Scalable kernel-based SVM classification algorithm on imbalance air quality data for proficient healthcare

Ketu, Shwet; Mishra, Pramod Kumar

doi:10.1007/s40747-021-00435-5

Scalable kernel-based SVM classification algorithm on imbalance air quality data for proficient healthcare

Original Article
Open access
Published: 29 June 2021

Volume 7, pages 2597–2615, (2021)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Scalable kernel-based SVM classification algorithm on imbalance air quality data for proficient healthcare

Download PDF

Shwet Ketu¹ &
Pramod Kumar Mishra¹

2748 Accesses
32 Citations
Explore all metrics

Abstract

In the last decade, we have seen drastic changes in the air pollution level, which has become a critical environmental issue. It should be handled carefully towards making the solutions for proficient healthcare. Reducing the impact of air pollution on human health is possible only if the data is correctly classified. In numerous classification problems, we are facing the class imbalance issue. Learning from imbalanced data is always a challenging task for researchers, and from time to time, possible solutions have been developed by researchers. In this paper, we are focused on dealing with the imbalanced class distribution in a way that the classification algorithm will not compromise its performance. The proposed algorithm is based on the concept of the adjusting kernel scaling (AKS) method to deal with the multi-class imbalanced dataset. The kernel function's selection has been evaluated with the help of weighting criteria and the chi-square test. All the experimental evaluation has been performed on sensor-based Indian Central Pollution Control Board (CPCB) dataset. The proposed algorithm with the highest accuracy of 99.66% wins the race among all the classification algorithms i.e. Adaboost (59.72%), Multi-Layer Perceptron (95.71%), GaussianNB (80.87%), and SVM (96.92). The results of the proposed algorithm are also better than the existing literature methods. It is also clear from these results that our proposed algorithm is efficient for dealing with class imbalance problems along with enhanced performance. Thus, accurate classification of air quality through our proposed algorithm will be useful for improving the existing preventive policies and will also help in enhancing the capabilities of effective emergency response in the worst pollution situation.

Imbalanced data classification algorithm with support vector machine kernel extensions

Article 10 October 2018

Air Quality Monitoring and Classification Using Machine Learning

Study on Class Imbalance Problem with Modified KNN for Classification

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

In the machine learning paradigms, the classification of the new objects based on similar instances is one of the crucial tasks. The classification task becomes more complicated when one of the classes contains fewer instances than the other class [1]. The class imbalance problem is nothing but an unequal distribution of the data among the various classes. In the class imbalance problem, the majority of data samples belong to individual classes, and the rest of the data samples belong to the other classes. With respect to the binary class imbalance problem, one class contains the maximum number of data samples, and the other class contains only a few data samples [2]. The class which contains the maximum number of samples is said to be the majority class, and the class with the minimal number of samples is said to be the minority class [3, 4].

In the field of machine learning, it is one of the challenging tasks for classification algorithms to learn from imbalanced data. We are facing data imbalance issues in almost all the domains, or we can say that it is quite a common problem in all the fields. The areas which are facing these issues are the medical domain [5, 6], marketing doming, image classification [7], agriculture, big data domain [8,9,10], IoT [11,12,13], and so on [14,15,16]. Class imbalance is one of the critical issues in machine learning paradigms. If the classification algorithms are biased towards the majority class, then the accuracy of the classification algorithms will suffer much. Thus, if the new sample will come for classification, then it will be classified into the majority class because the classifier has lower prediction accuracy toward the minority class. This situation is highly unappropriated and a severe matter of concern [17].

Nowadays, a drastic change in the air pollution level has been seen [18]. The pollution level in metropolitan cities is increasing, which not a very good sign for us. For making the environment healthier and comfortable, the air pollution level should be minimal. There are various liable factors that are making air polluted [19,20,21,22]. Some of them are directly, and some are indirectly participating in polluting the air. These pollutants are coming from various domains such as from industry, from transportation services, from daily traffic, from the thermal power plant, from various home appliances, garbage material from industries, hospitals and homes, and so on. The high level of air pollution can harm humans, animals, as well as botanical plants too [23]. Consequentially, a lot of new cases related to the respiratory system have been seen, which is the impact of bad air quality on human beings. It is also affecting crop quality and overall crop production. Thus, to reduce the effect of air pollution, we have to correctly classify the pollution level in real-time. From time to time, many researchers had contributed their approaches, which were accurate to some extent [24,25,26,27,28]. But due to the imbalanced nature of data, these models were not giving the correct prediction of the classes [29,30,31,32].

Building the classifier using the imbalanced datasets is one of the difficult tasks. In the classification task of imbalanced datasets, the minority class always suffers from the majority class because the classification model is biased with the majority class [33, 34]. As resultant, if any new sample comes for the classification, then it is classified in the majority class. This immanent need and gigantic interest motivate the researchers to deal with the class imbalance issues. From time to time, many researchers had given valuable solutions to deal with this class imbalance problem. These approaches were beneficial and capable of solving the problem to some extent by improving classifiers' performance. Most of the solutions were proposed for the binary class imbalance problems, which were not suitable for the multi-class imbalance problem. These limitations motivate us to deal with multi-class imbalance problems and also encourages us to give a possible contribution that will able to solve the multi-class imbalance problem. The contribution which we have worked on are:

This solution is designed in a way that, which is well suited for both binary class and multi-class imbalance problems.
The solution is based on algorithmic modification rather than data resampling at the processing phase.
In our solution, the new kernel selection function has been proposed.

In this paper, the scalable kernel-based SVM (Support Vector Machine) classification algorithm has been proposed, which is capable of dealing with the multi-class data imbalance problem. First of all, the approximate hyperplane is gained using the standard SVM algorithm. After that, the weighting factor and parameter function for every support vector at each iteration is calculated. The values of these parameters are calculated using the Chi-square test. After that, the new kernel function or kernel transformation function is calculated. With the help of this kernel transformation function, the uneven class boundaries have been expanded, and the skewness of the data has been compensated. Therefore, the approximate hyperplane can be corrected by the proposed algorithm, and it can also resolve the performance degradation issue. In this study, we have also discussed the impact of air pollution on human health.

The rest of this paper has been arranged as follows. The related work has been drawn in “Related work”. A brief discussion about the datasets, the working of the proposed algorithm with the mathematical foundation, and ten performance evaluation metrics have been briefly illustrated in “Materials and methods”. The results of standard methods, existing literature methods and proposed classification algorithm have been presented in “Results”. In “Discussion”, the comprehensive conversation about the classification results and the effect of bad air quality on health have been discussed. Concluding remarks with the future scope have been drawn in “Conclusion”.

Related work

Building the classifier using the imbalanced datasets is one of the difficult tasks. In the classification task of imbalanced datasets, the minority class always suffers from the majority class because the classification model is biased with the majority class [33, 34]. As resultant, if any new sample comes for the classification, then it is classified in the majority class. From time to time, many strategies have been made to overcome class imbalance issues. These proposed strategies are either work on the algorithm level or at the data level.

The data level approach is based on the resampling technique. Many classification algorithms such as SVM, naïve Bayes, C4.5, AdaBoost, and so on are using the resampling technique to deal with the data imbalance problem. The resampling task is consisting of two subtasks, i.e., under-sampling and over-sampling [35, 36]. The under-sampling technique is the process of filtering out the irrelevant sample from the dataset, and in the oversampling technique, we generate the new synthetic data. Two effective under-sampling methods had been proposed by Liu et al. [37], i.e., BalanceCascade and EasyEnsemble. In the BalanceCascade technique, the sample that was correctly classified at each step was removed and did not participate in the further classification task. In the EasyEnsemble method, the majority class was divided into various subsets. These subsets had been used as input for the learner. The SMOTE stands for Synthetic Minority Over- Sampling Technique. It is one of the intelligent techniques which is based on an oversampling approach [36]. The oversampling in SMOTE is done by generating the syntactic samples for minority classes. The adaptive oversampling method had been proposed by Wang et al. [38], which was based on the data density approach. The binary class oversampling approach had been proposed by Geo et al. [39], which was based on the probability density function. Gu et al. [40] had discussed the data mining approaches on an imbalanced dataset.

The algorithmic level approaches are designed to bias the learning process to reduce the participation of the majority class and improve classifier performance. The solution for algorithmic level approaches has mainly consist of modification in algorithms, cost-sensitive learning, ensemble learning, and active learning.

The cost-sensitive learning approach is based on the concept of asymmetric cost assignment policy by minimizing the cost of misclassified samples. The cost minimization in a cost-sensitive approach is the process of penalizing the misclassified class with the penalty. But giving the desired penalty at every class level is an arduous task [41, 42]. Three cost-sensitive boosting algorithms for the classification of an imbalanced dataset in the AdaBoost framework had been introduced by Sun et al. [41]. The cost-sensitive SVM (support vector machine) had been proposed by Wang [43] to deal with the data imbalance problem.

To deal with the data imbalance problem, some researchers have done the modification in the algorithm level. The amendment in the algorithm level can be done at the classifier level by optimizing the classifier. The fuzzy-based SVM was proposed by Batuwita and Palade [44] to deal with imbalanced data in the presence of noise and outliers. Cano et al. [45] had proposed imbalanced data classification based on weighted gravitation. The adjusted boundary-based class boundary alignments with improved SVM performance had been proposed by Wu and Chang [46, 47].

The ensemble learning approach was designed to increase the accuracy of the classification algorithm. In this approach, several classifiers are used to train the model, and the decision output of these classifiers is incorporated into a single class. This final output is used for decision-making [3]. Bagging and boosting are the vital machine learning algorithm in ensemble learning paradigms [3]. The active sample election technique was used by Oh et al. [48] to resolve the data imbalance problem. The sampling techniques (both under-sampling and over-sampling) had been integrated with the SVM to improve the classifier performance by Liu et al. [49].

The active learning approach is one of the exceptional cases of machine learning paradigms that have been used to label the new data sample points with the help of desired outputs by earning query interactively with a user [50]. The CBAL, which is a certainty-based active learning algorithm, was proposed by Fu and lee in 2013 [51] to solve the issue of data imbalance. Based on the various existing literature, the classification algorithms used to deal with the data imbalance problem have been briefly shown in Table 1.

Table 1 Classification algorithms to deal with data imbalance problem

Full size table

Materials and methods

In this section, we will talk about the material and method which has been used in the experimental analysis. This section consists of three subsections, i.e., dataset illustration, proposed algorithm, and statistical measures. In the first subsection, the sensor-based CPCB dataset of Delhi has been discussed. In the second subsection, the proposed scalable kernel-based SVM classification algorithm has been discussed with its mathematical foundation. In the third subsection, a brief discussion about the performance evaluation metrics has been presented.

Data

For this study, we have taken the sensor-based CPCB data of Delhi city, which is the most polluted city in India. The reason behind taking this benchmark data is that the continuous monitoring of air quality with more than 200 base stations in approximately 20 states is maintained by the CPCB (Central Pollution Control Board). All the data from these stations are openly accessible from the website of CPCB. As far as Delhi is concerned, there are 37 base stations that are monitoring data continuously (24*7).

As we know, India comes second with respect to the total number of populations live after china [52, 53]. The massive growth in population is one of the key reasons for increasing pollution levels. Delhi is the capital and industrial hub of India; therefore, the population density of Delhi people is more than the respect of other cities. Resultant, the pollution caused by industrial waste and vehicles are the main reason for increasing the pollution level of Delhi [54, 55]. High discharge of various gasses, i.e., NO2, NH3, NO, CO2, O3, and CO, with additional factors like wind direction, wind speed, temperature, and relative humidity make the air of Delhi heavily polluted and toxic. The toxic particles and other harmful particles are dissoluble in the air. Thus, living in such a polluted environment may cause some severe diseases. Even death is also possible in more severe cases. So, we have to take preventive measures to enhance the excellent quality of life by reducing the pollution levels for human well-being.

For the experimental analysis, the dataset from the Indian central pollution control board (CPCB) of capital Delhi has been taken [56]. The dataset has been extracted from various sensor-based devices. These sensor-based devices have been placed in multiple locations of Delhi and have been shown in Fig. 1. The figure has been plotted with the help of the longitude, and latitude of various data collection points falls of the Delhi region. The 37 data collection centers of Delhi have been plotted with the red circle in Fig. 1. We have taken the data from January 1, 2019, to October 1, 2020. The data has been recorded twenty-four times a day, which means, on an hourly basis. The CPCB air quality dataset is enriched with numerous liable features that can play an essential role in air quality classification tasks. These responsible features are PM10 (Concentration of Inhalable Particles), SO₂ (Sulfur Dioxide), PM2.5 (Fine Particulate Matter), O₃ (ozone), NOx (Nitrogen Oxide), NO₂ (Nitrogen Dioxide), NO (Nitrogen Monoxide), NH₃ (Ammonia), CO (Carbon Monoxide), AQI (Air Quality Index), WD (Wind Direction), C₆H₆ (Benzene), WS (Wind Speed), RH (Relative Humidity), SR (Solar Radiation), BP (Bar Pressure) and AT (Absolute Temperature). The dataset taken for the classification task contains 16 columns and 332,880 rows or 16 columns and 8760 rows at each base station (37 base stations are taken into consideration).

In this research work, the classification task has been performed on the CPCB air quality dataset, which contains various attributes. Only those attributes have been taken into consideration, which is responsible for making air pollution levels high. These attributes are the concentration of inhalable particles (PM10), sulfur dioxide (SO₂), fine particulate matter (PM2.5), ozone (O₃), nitrogen oxide (NOx), nitrogen dioxide (NO₂), nitrogen monoxide (NO), ammonia (NH3), carbon monoxide (CO), Air Quality Index (AQI) and, so on.

In Table 2, the various features of the dataset which are participating in the classification task have been presented. The various features have been explored with the help of several parameters, i.e., name of the variable with the abbreviation, nature of data, variable measuring unit, the period of data collection, variable type, and at last data extraction source.

Table 2 Substantial features of the dataset a quick look

Full size table

Table 3 presents the various features of the dataset with the help of several parameters, such as the name of the variable with its mean values, measuring unit, standard derivation, and actual and prescribed range of variables. These liable features have been used in the classification task.

Table 3 Variable description

Full size table

Table 4 represents the data that has been come from the preprocessing and taken for the experimental analysis. This preprocessed data contains six classes, 270,596 samples, and ten attributes in each sample. The class-wise distribution of the dataset is 13,452, 47,910, 93,167, 55,045, 30,421, and 30,601 for class one to class six. The class imbalances ratio among the classes is 6.92.

Table 4 Preprocessed dataset description

Full size table

Table 5 represents the description of the Air Quality Index (AQI), which contains the AQI range, appropriate AQI labeling, and class level. The labeling has been done into six parts according to the range from 0 to more than 400 [56]. The linking of the CPCB dataset with the AQI range is also established here.

Table 5 Air quality description

Full size table

Proposed methodology

The primary aim of the proposed algorithm is to deal with the data imbalance problem efficiently. The proposed algorithm is based on the concept of the adjusting kernel scaling method (AKS) [57] to deal with the multi-class imbalanced dataset. In this paper, we have proposed the SVM classification, which has been integrated with the adjusting kernel scaling method. In this section, a detailed discussion about the proposed algorithm has been presented.

Basic support vector machine algorithm (SVM)

Support Vector Machine (SVM) is a widely used and well-known machine learning algorithm for data classification. The SVM algorithm had been proposed by Vapnik et al. [58] in 1995. The primary aim of designing this algorithm was to map the input data into high dimensional space with the help of kernel function so that the classes can be linearly separable [58,59,60]. In the cased of the binary class problem, the maximum margin that can separate the hyperplanes is presented:

$$w.x+b=0$$

(1)

Based on the optimal pair ${(w}_{0}{,b}_{0})$, the decision function for SVM is represented by:

$$f\left(x\right)=\sum _{j\in SV}{\lambda }_{j}{y}_{j } \langle x.{x}_{j } \rangle +b$$

(2)

where, ${\lambda }_{j}$ is support vector, ${x}_{j}$ is data sample and $j=\mathrm{1,\,2},\dots ,C$.

Figure 2 shows the hyperplane with maximum separating margin and support vectors in the SVM algorithm paradigm.

For higher dimensional feature space, the value of $ \langle x.{x}_{j} \rangle $ is replaced by the kernel function $K \langle {x}_{j}.{x}_{\mathrm{i}} \rangle $ that is:

$$K \langle {x}_{j }.{x}_{i} \rangle = \langle {x}_{j }.{x}_{i} \rangle $$

(3)

Kernel function selection

In this section, the kernel function has been chosen from the standard SVM for approximately computing the boundaries' position. Initially, the dataset $P$ is split into various samples which are ${P}^{1},{P}^{2},{P}^{3},\dots ,{P}^{j}$ and after that, the kernel transformation function is applied that is defined in the below equation.

$$f\left(x\right)= \left\{\begin{array}{l}{e}^{{-z}_{1}{h\left(x\right)}^{2}},\quad if x\in {P}^{1} \\ {e}^{{-z}_{2}{h(x)}^{2}}, \quad if x\in {P}^{2} \\ . \\ . \\ . \\ {e}^{{-z}_{C}{h(x)}^{2}, } \quad if x\in {P}^{C}\end{array}\right.$$

(4)

where, $h(x)=\sum _{j\in SV}{\lambda }_{j}{y}_{j} \langle x.{x}_{j} \rangle +b$ (where, ${\lambda }_{j}$ is support vector), ${P}^{j}$ is jth sample of the training set, the value of the parameter ${\mathrm{z}}_{\mathrm{j}}$ is computed from the chi-square test ${(\chi }^{2}),$ which is explained in Sect. 2.2.2 and$j=\mathrm{1,2},\dots ,C$.

Testing of Chi-square

The Chi-square test ${(\chi }^{2})$ is one of the important statistical tests applied to sets of categorical features to determine the frequency distribution-based association among the categorical feature groups. In other words, we can say that it is used to evaluate the correlation among the groups. The significance of calculating the chi-square test is to establish the relationship among the samples of each category and parameter $z_{j}$. The mathematical formulation of evaluating the chi-square test ${(\chi }^{2})$ is:

$${\chi }^{2}= \sum \frac{{{(f}_{0}-{f}_{e})}^{2}}{{f}_{e}}$$

(5)

where, $f_{e}$ and $f_{o}$ are denoted as expected frequency and observed frequency, respectively.

Computing the weighting factor

The weighting factor is one of the important and difficult issues while dealing with the class imbalance problem. It is very difficult because assigning the appropriate weight for overcoming the class imbalance problem makes it complex. The simple way to deal with such problems is to give less weight to the majority class and more weight to the minority class by satisfying the weight condition ${z}_{i}\in (\mathrm{0,1})$.

The formulation of the weighting factor setting method has been used in the proposed algorithm to deal with the multi-class imbalance problem. In other words, we can say that the method that has been used to compensate for the uneven data distribution is defined as:

$${w}_{j}= \frac{N/{n}_{j}}{\sum _{j=1}^{C}N/{n}_{j}}$$

(6)

where, $N$ and $C$ denote the training sample size and category size, respectively. nj indicates the sample size of every category with $j=\mathrm{1,\,2},\dots ,C$.

Computing the parameter ${\boldsymbol{z}}_{\boldsymbol{j}}$

Let $P$ is the dataset, which includes the $N$ number of samples with $C$ categories. The value of the parameter ${z}_{j}$ is calculated using Eqs. 2 and 3. The value of the chi-square ${(\chi }^{2})$ in optimal distribution is,

$${\chi }^{2}= \sum _{j=1}^{C}\frac{{{(n}_{j}-N/C)}^{2}}{N/C}$$

(7)

where ${n}_{j}$ = number of samples in jth category and $j={1,2},\dots ,C$

Let, ${X}_{j}=\frac{{{(n}_{j}-N/\mathrm{C})}^{2}}{N/C}$

Then,

$${\chi }^{2}= \sum _{j=1}^{C}{X}_{j}$$

(8)

So, the parameter ${z}_{j}$ can be defined as

$${z}_{j}= {w}_{j}\times \frac{{X}_{j}}{{\chi }^{2}}$$

(9)

From the Eq. (8) put the value of ${\chi }^{2}$

$${z}_{j}= {w}_{j}\times \frac{{X}_{j}}{\sum _{j=1}^{C}{X}_{j}}$$

(10)

Description of the proposed algorithm

The flow chart of the proposed algorithm has been shown in Fig. 3. First of all, the cleaning of the CPCB air quality dataset is performed, and after that, this proposed data is served to the classification algorithm for obtaining the initial partition. In the second step, we calculate the value of the weighting factor ${w}_{j}$ and parameter ${z}_{j}$ for every support vector in each iteration. The value of this parameter is calculated using the Chi-square test. In the next step, the kernel transformation function is calculated, and finally, the classification model is retrained using the new computed kernel matrix ${K}_{mt}$.

The algorithm for the proposed classification model contains 11 steps and these steps are described in Algorithm 1.

Statistical analysis

In this section, the various statistical measures used to evaluate the performance of the algorithms have been discussed. Statistical analysis is one of the essential tasks that help us pick the best algorithm based on their performance. In this paper, some statistical measures for evaluating the proposed algorithm and existing algorithms have been chosen to find the best algorithm among them. The statistical measures which have been taken into consideration are accuracy, precision, recall, f1-score, and TNR, NPV, FNR, FPR, FDR, FOR [61,62,63,64]. With the help of these ten evaluation measures, we can determine the appropriate algorithm that can perform the classification task more effectively and efficiently.

Accuracy

The accuracy with respect to the classification task is the percentage of instances that are correctly classified. In other words, we can say that accuracy is the percentage ratio of the correctly predicted class over the entire testing class. The accuracy formulation has been described in the below equation [61].

$$Accuracy=\frac{(TP+TN)}{(TP+TN+FP+FN)} \times 100 \%$$

(11)

where $TP,FP$ are the number of true positive and false positive respectively, and $FN,TN$ represents the number of false negatives and true negatives, respectively.

Precision

The precision, with respect to the classification task, is used to quantify the number of predicted positive classes that are actually falling under the positive class. In other words, we can say that precision is the ratio of the true positive class over the total number of a truly positive and false-positive class. The precision formulation has been described in the below equation [61, 63].

$$Precision=\frac{(TP)}{(TP+FP)}$$

(12)

where, $TP\mathrm{a}\mathrm{n}\mathrm{d}FP$ are the numbers of true positive and false positive, respectively.

Recall

The recall, with respect to the classification task, is used to quantifies the number of predicted positive class actually falls out of all positive instances in the dataset. In other words, we can say that recall is the ratio of the true positive class over the total number of a truly positive and false negative class. The recall formulation has been described in the below equation [63].

$$Recall=\frac{(TP)}{(TP+FN)}$$

(13)

where, $TP\mathrm{a}\mathrm{n}\mathrm{d}FN$ are the numbers of true positive and false negative, respectively.

F1-score

The F1- score is also known as F Measure or F Score. The F1-score, with respect to the classification task, is used to quantifies the balance among the recall and precision. In other words, we can say that F1-score is the twice product of recall and precision over the summation of recall and precision. The f1-score formulation has been described in the below equation [62].

$$f1=\frac{2\times \left(precision \times recall\right)}{precision+recall}$$

(14)

True negative rate (TNR)

The TNR, with respect to the classification task, is used to quantifies the specificity or true negative rate. In other words, we can say that TNR is the ratio of the true negative class over the total number of a truly negative and false positive class. The TNR formulation has been described in the below equation [61, 64].

$$TNR =\frac{(TN)}{(TN+FP)}$$

(15)

where, $TN\,\mathrm{a}\mathrm{n}\mathrm{d}\,FP$ are the numbers of true negative and false positive, respectively.

Negative predictive value (NPV)

The NPV, with respect to the classification task, is used to quantifies the ratio of negative predictive value. In other words, we can say that NPV is the ratio of the true positive class over the total number of a truly positive and false negative class. The NPV formulation has been described in the below equation [61, 64].

$$NPV=\frac{(TN)}{(TN+FN)}$$

(16)

where, $TN\,\mathrm{a}\mathrm{n}\mathrm{d}\,FN$ are the numbers of true and false negative, respectively.

False negative rate (FNR)

The FNR, with respect to the classification task, is used to quantifies the miss rate. In other words, we can say that FNR is the ratio of the false-negative class over the total number of a truly positive and false negative class. The FNR formulation has been described in the below equation [61, 64].

$$FNR =\frac{(FN)}{(FN+TP)}$$

(17)

where, $TP\,\mathrm{a}\mathrm{n}\mathrm{d}\,FN$ are the numbers of true positive and false negative, respectively.

False positive rate (FPR)

The FPR, with respect to the classification task, is used to quantifies the fall-out rate. In other words, we can say that FPR is the ratio of the false positive class over the total number of a truly negative and false positive class. The FPR formulation has been described in the below equation [61, 64].

$$FPR=\frac{(FP)}{(FP+TN)}$$

(18)

where, $FP\,\mathrm{a}\mathrm{n}\mathrm{d} \; TN$ are the numbers of false-positive and true negative, respectively.

False discovery rate (FDR)

The FDR is the ratio of the false positive class over the total number of a truly positive and false-positive class. The FDR formulation has been described in the below equation [61, 64].

$$FDR=\frac{(FP)}{(FP+TP)}$$

(19)

where, $FP\,\mathrm{a}\mathrm{n}\,\mathrm{d}TP$ are the numbers of false and true negative, respectively.

False omission rate (FOR)

The FOR is the ratio of the false-negative class over the total number of a truly negative and false negative class. The FOR formulation has been described in the below equation [61, 64].

$$FOR=\frac{(FN)}{(FN+TN)}$$

(20)

where, $TN\,\mathrm{a}\mathrm{n}\,\mathrm{d}FN$ are the numbers of true and false negative, respectively.

Results

In this section, we will talk about the classification result based on classification algorithm, i.e., Ada Boost Algorithm (ADB) [65,66,67], Multilayer Perceptron Algorithm (MLP) [68,69,70], Gaussian NB Algorithm (GNB) [71,72,73], standard Support Vector Machine Algorithm (SVM) [58,59,60], existing literature methods and proposed scalable-kernel based SVM algorithm.

Model comparison

Identifying the best classification model capable of dealing with class imbalance problems is one of the complex tasks. The CPCB air quality dataset has been taken for the experimental analysis. In Fig. 4, the x-axis denotes the various classes, and the y-axis indicates the number of data samples in the multiple classes. From Fig. 4, it is clear that our dataset contains uneven class distribution, or we can say that it is imbalanced. Therefore, it becomes more challenging to handle such a situation by the traditional classification models. The class-wise distributions of the dataset based on sample size are: the first class consists of 13,452 samples, the second one contains 47,910 samples, the third one has 93,167 samples, the fourth class has 55,045 samples, the fifth class contains 30,421 samples, and the last class contains 30,601 samples. The dataset also has a 6.92 class imbalance ratio.

The primary aim of this research work to find out the best classification model which can deal with the class imbalance problem. From time to time, many researchers had given valuable solutions to deal with this class imbalance problem. Most of the solutions were proposed for the binary class imbalance problems and which did not find suitable for the multi-class imbalance problem. These limitations motivate us to modify the algorithm that can efficiently deal with multi-class and binary class imbalance problems without compromising the algorithm's performance. This classification will also be helpful for making the possible solution toward proficient healthcare.

For the experimental evaluation, the four well-established traditional classification algorithms and existing literature methods with our proposed algorithm have been taken. Our proposed algorithm has been compared with other algorithms to determine suitability, correctness, and efficiency. The ten performance validation measures have measured the performance of all the classification algorithms. The tenfold cross-validation policy has been used.

Figure 5 shows an overview of the classification algorithms, which have been used in the classification task. The four classification algorithms and existing literature methods have been compared with our proposed algorithm to determine the performance of the proposed classifier. The algorithms which have been used in the classification task are ADB (Ada Boost Algorithm), MLP (Multilayer Perceptron Algorithm), GNB (Gaussian NB Algorithm), standard SVM (Support Vector Machine Algorithm), existing literature methods, and proposed scalable-kernel based SVM algorithm.

Performance evaluation of classification algorithms

The performance evaluation of the classification algorithm has been divided into two parts. In the first part, the CPCB air quality dataset of the whole Delhi region has been taken, which has been come from the 37 distributed base stations. All the data has been served as a single file to perform the classification task. The classification result of all algorithms has been evaluated in the form of precision, recall, F1 score, TNR, NPV, FNR, FPR, FDR, FOR, and accuracy. The validation of the classification algorithm has been performed based on the classification accuracy. As we know, our dataset contains imbalanced class distribution that may affect the classification algorithms' performance. All standard models perform well except Ada Boost Classifier (ADB). The ADB classifier achieves the lowest accuracy of 59.72 among all the classifiers. The standard SVM classifier, MLP classifier, and Gaussian NB perform quite well in imbalanced class distribution. But if we compare it to our proposed SVM classifier, then these classifiers are lost in the battle. Our proposed algorithm wins the battle with the highest accuracy of 99.66 among all the other models. The detailed analysis of the classification results has been shown in Table 6.

Table 6 Performance evaluation of classification algorithms I

Full size table

In the second part of the performance evaluation, we have taken the CPCB dataset, which is coming from 37 places in Delhi. The proposed algorithm achieves the highest accuracy of 99.66% among the existing literature methods. It is also efficient for dealing with class imbalance problems without compromising performance. Performance evaluation of existing literature method Vs proposed classification algorithm has been presented in Table 7.

Table 7 Performance evaluation of existing literature methods vs proposed classification algorithm

Full size table

In the second part of the performance evaluation, we have taken the individual data of each base station of CPCB, which is plotted at 37 places in Delhi. The 37 data files have been used as input datasets for performing the classification task with the help of various classification algorithms. The details about the acronyms used in Table 8 have been defined in Appendix 1. Our proposed algorithm has performed exceptionally well in this rigorous analysis for all the datasets lying from A1 to A37. Our proposed algorithm achieves the highest average accuracy of 99.72 (average of A1 to A 37) among all the algorithms. It is also efficient for dealing with class imbalance problems without compromising performance. The detailed analysis of the results has been shown in Table 8.

Table 8 Performance evaluation of classification algorithms II

Full size table

Discussion

Numerous associated factors exist which may play a crucial role in affecting air quality. Some factors directly and some indirectly participate in polluting the air. Those pollutants which are air soluble are hazardous to human health. The poor diffusion condition is one of the crucial factors which play a vital role in increasing the level of pollutant. The drive of air partials from the high concentration space to the low concentration space is known as diffusion. Before performing the classification task, the preprocessing of the dataset is performed. Preprocessing is a process of dropping missing values and the unusual object from the datasets. The dataset is consist of numerous liable features such as PM10 (Concentration of Inhalable Particles), SO2 (Sulfur Dioxide), PM2.5 (Fine Particulate Matter), O3 (ozone), NOx (Nitrogen Oxide), NO2 (Nitrogen Dioxide), NO (Nitrogen Monoxide), NH3 (Ammonia), CO (Carbon Monoxide), AQI (Air Quality Index), WD (Wind Direction), C6H6 (Benzene), WS (Wind Speed), RH (Relative Humidity), SR (Solar Radiation), BP (Bar Pressure) and AT (Absolute Temperature). The correlation on the preprocessed data is calculated to find out the relation between the class and the liable factors.

In Fig. 6, the relationship between class and respondent factors has been shown. With the help of correlations, we can easily find which responsive factors are highly correlated with the class.

Performance evaluation of classification algorithms

For the experimental analysis, the dataset from the Indian central pollution control board (CPCB) of the capital Delhi has been taken. The data from January 1, 2019, to October 1, 2020, has been used for training and testing purposes. The tenfold cross-validation policy has been used. Cross-validation is a technique to assess models by partitioning the given data sample into the training and testing sets. The training set is used to train the model whereas the testing set to evaluate the model. In k-fold cross-validation, the given data sample is randomly divided into the k subsamples of equal size. Where the k-1 subsample is used for training the model and a single subsample is used for validation purposes. This cross-validation technique is repeated up to k times (k- folds) and each subsample is used exactly once for validation purposes. The single estimation is produced by averaging all the result fall under k-fold. The algorithms which have been used in the classification task are ADB (Ada Boost Algorithm), MLP (Multilayer Perceptron Algorithm), GNB (Gaussian NB Algorithm), standard SVM (Support Vector Machine Algorithm), existing literature methods, and proposed scalable kernel-based SVM algorithm.

In Fig. 7, the experimental results, (i.e. statistical measures based and existing literature methods versus proposed algorithm) of the various classification algorithms on the CPCB dataset of the whole Delhi region have been presented. From the figure, it is clear that our proposed algorithm with the highest accuracy of 99.66 wins the race among all the classification algorithms and existing literature methods. The result of the proposed algorithm is also better than the traditional SVM algorithm. So, it is also clear from the results that our proposed algorithm is efficient for dealing with class imbalance problems without compromise the performance of the algorithm.

In Fig. 8, the accuracy-based classification results of the various classification algorithms on the CPCB dataset, specifically A1, A10, A20, A30, and A37 of the Delhi region have been plotted using a bar graph. From the figure, it is clear that our proposed algorithm achieves the highest accuracy throughout the areas and wins the race among the classification algorithms. The results of the proposed algorithm are also better than the traditional SVM algorithm. Thus, it is also clear from the results that our proposed algorithm is efficient for dealing with class imbalance problems along with enhanced performance.

Effect on healthcare

Bad air quality can impact individuals' health and quality of life. The impact of bad air quality may cause problems from minor to severe. It may affect individuals' cardiovascular or circulatory system, respiratory system, excretory system (kidney or urinary), nervous system, endocrine system, circulatory system, digestive system, lymphatic system, integumentary system (skin), and ophthalmic system.

Table 9 shows the AQI range with associated labeling, and the impact of various air levels on health has been shown [56]. The AQI level is divided into the six-range starting from 0–50 and end at greater than 400.

Table 9 Air quality index range with possible health impact

Full size table

The consequence of high AQI levels on individuals' health has been described in Table 10. The various effects of high AQI levels are divided into three subparts, i.e., short-term impact, long-term impact, and severe impact. It may cause severe problems for those people who are suffering from respiratory diseases. Such people require intensive care, and precaution must be taken to minimize its impact on their health [74,75,76,77].

Table 10 The effect of high air quality index level on person’s health

Full size table

Conclusion

In numerous classification problems, we are facing the class imbalance issue. This research is focused on dealing with the imbalanced class distribution so that the classification algorithm will not compromise its performance. The proposed algorithm is based on the concept of the adjusting kernel scaling (AKS) method to deal with the multi-class imbalanced dataset. The scalable kernel-based SVM classification algorithm has been proposed and presented in this study. In the proposed algorithm, the kernel function's selection has been evaluated based on the weighting criteria and chi-square test. Using this kernel transformation function, the uneven class boundaries have been expanded, and the skewness of the data has been compensated. For experimental evaluation, we have taken the accuracy-based classification results of the various classification algorithms on the CPCB dataset of Delhi to find and evaluate the performance of our proposed algorithm over the other classification algorithms. Our proposed algorithm with the highest accuracy 99.66% wins the race among all classification algorithms, and the result of the proposed algorithm is even better than the traditional SVM algorithm. The results of the proposed algorithm are also better than the existing literature methods. It is also clear from these results that our proposed algorithm is efficient in dealing with the class imbalance and enhanced performance. In this study, we have also discussed the effect of air pollution on human health, which is possible only if the data are correctly classified. Thus, accurate air quality classification through our proposed algorithm would be useful for improving the existing preventive policies and would also help enhance the capabilities of effective emergency response in the event of the worst pollution.

In the future, this algorithm will be compared with the recent variation of SVM. The proposed algorithm will be tested on other datasets, and we will try to improve its computational methods as well.

References

Menardi G, Torelli N (2014) Training and assessing classification rules with imbalanced data. Data Min Knowl Disc 28(1):92–122
Article MathSciNet MATH Google Scholar
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
Article MATH Google Scholar
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2011) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(4):463–484
Article Google Scholar
Wang S, Yao X (2012) Multi-class imbalance problems: analysis and potential solutions. IEEE Trans Syst Man Cybern Part B (Cybern) 42(4):1119–1130
Article Google Scholar
Ketu S, Mishra PK (2021) Hybrid classification model for eye state detection using electroencephalogram signals. Cognit Neurodyn 1–18
Ketu S, Mishra PK (2020). A hybrid deep learning model for COVID-19 prediction and current status of clinical trials worldwide. Comput Mater Contin 66(2)
Tali RV, Borra S, Mahmud M (2021) Detection and classification of leukocytes in blood smear images: state of the art and challenges. Int J Ambient Comput Intell (IJACI) 12(2):111–139
Article Google Scholar
Ketu S, Agarwal S (2015) Performance enhancement of distributed K-Means clustering for big Data analytics through in-memory computation. In: 2015 Eighth international conference on contemporary computing (IC3), IEEE, pp 318–324
Ketu S, Prasad BR, Agarwal S (2015) Effect of corpus size selection on performance of map-reduce based distributed k-means for big textual data clustering. In Proceedings of the sixth international conference on computer and communication technology 2015, pp 256–260
Ketu S, Kumar Mishra P, Agarwal S (2020). Performance analysis of distributed computing frameworks for big data analytics: hadoop vs spark. Comput Sistemas 24(2)
Ketu S, Mishra PK (2020) Performance analysis of machine learning algorithms for IoT-based human activity recognition. In Advances in electrical and computer technologies, pp 579–591, Springer, Singapore
Ketu S, Mishra PK (2021) Enhanced Gaussian process regression-based forecasting model for COVID-19 outbreak and significance of IoT for its detection. Appl Intell 51(3):1492–1512
Article Google Scholar
Ketu S, Mishra PK (2021) Cloud, fog and mist computing in IoT: an indication of emerging opportunities. IETE Tech Rev, pp 1–12
Chawla NV, Japkowicz N, Kotcz A (2004) Special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6
Article Google Scholar
Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(2–3):427–436
Article Google Scholar
Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2–3):195–215
Article Google Scholar
Daskalaki S, Kopanas I, Avouris N (2006) Evaluation of classifiers for an uneven class distribution problem. Appl Artif Intell 20(5):381–417
Article Google Scholar
Vitousek PM (1994) Beyond global warming: ecology and global change. Ecology 75(7):1861–1876
Article Google Scholar
Yilmaz O, Kara BY, Yetis U (2017) Hazardous waste management system design under population and environmental impact considerations. J Environ Manag 203:720–731
Article Google Scholar
De Vito S, Piga M, Martinotto L, Di Francia G (2009) CO, NO2 and NOx urban pollution monitoring with on-field calibrated electronic nose by automatic bayesian regularization. Sens Actuators B Chem 143(1):182–191
Article Google Scholar
Northey SA, Mudd GM, Werner TT (2018) Unresolved complexity in assessments of mineral resource depletion and availability. Nat Resour Res 27(2):241–255
Article Google Scholar
Zhang Q, Jiang X, Tong D, Davis SJ, Zhao H, Geng G, Ni R (2017) Transboundary health impacts of transported global air pollution and international trade. Nature 543(7647):705–709
Article Google Scholar
Du X, Kong Q, Ge W, Zhang S, Fu L (2010) Characterization of personal exposure concentration of fine particles for adults and children exposed to high ambient concentrations in Beijing, China. J Environ Sci 22(11):1757–1764
Article Google Scholar
Soh PW, Chang JW, Huang JW (2018) Adaptive deep learning-based air quality prediction model using the most relevant spatial-temporal relations. IEEE Access 6:38186–38199
Article Google Scholar
Yi X, Zhang J, Wang Z, Li T, Zheng Y (2018) Deep distributed fusion network for air quality prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 965–973
Zhang Y, Wang Y, Gao M, Ma Q, Zhao J, Zhang R, Huang L (2019) A predictive data feature exploration-based air quality prediction approach. IEEE Access 7:30732–30743
Article Google Scholar
Iskandaryan D, Ramos F, Trilles S (2020) Air quality prediction in smart cities using machine learning technologies based on sensor data: a review. Appl Sci 10(7):2401
Article Google Scholar
Xue H, Bai Y, Hu H, Xu T, Liang H (2019) A novel hybrid model based on TVIW-PSO-GSA algorithm and support vector machine for classification problems. IEEE Access 7:27789–27801
Article Google Scholar
Mishra M (2019) Poison in the air: Declining air quality in India. Lung India Off Org Indian Chest Soc 36(2):160
Article Google Scholar
Bishop CM (2006) Pattern recognition and machine learning. Springer, New York
MATH Google Scholar
Packtpub (2018) Machine Learning algorithms. Available online: https://www.packtpub.com/in/big-data-and-business-intelligence/machine-learning-algorithms-second-edition. Accessed on 9 Dec 2019
Longadge R, Dongre S (2013) Class imbalance problem in data mining review. arXiv:1305.1707
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Gao M, Hong X, Chen S, Harris CJ (2011) A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems. Neurocomputing 74(17):3456–3466
Article Google Scholar
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Icml, vol 97, pp 179–186
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article MATH Google Scholar
Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B (Cybernetics) 39(2):539–550
Article Google Scholar
Prati RC (2012) Combining feature ranking algorithms through rank aggregation. In: The 2012 international joint conference on neural networks (IJCNN), pp 1–8. IEEE
Gao M, Hong X, Chen S, Harris CJ (2012) Probability density function estimation based over-sampling for imbalanced two-class problems. In: The 2012 international joint conference on neural networks (IJCNN), pp 1–8, IEEE
Gu Q, Cai Z, Zhu L, Huang B (2008) Data mining on imbalanced data sets. In: 2008 International Conference on advanced computer theory and engineering (pp 1020–1024). IEEE
Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn 40(12):3358–3378
Article MATH Google Scholar
Zhang Y, Wang D (2013) A cost-sensitive ensemble method for class-imbalanced datasets. In Abstract and applied analysis, vol 2013, Hindawi
Wang BX, Japkowicz N (2010) Boosting support vector machines for imbalanced data sets. Knowl Inf Syst 25(1):1–20
Article Google Scholar
Batuwita R, Palade V (2010) FSVM-CIL: fuzzy support vector machines for class imbalance learning. IEEE Trans Fuzzy Syst 18(3):558–571
Article Google Scholar
Cano A, Zafra A, Ventura S (2013) Weighted data gravitation classification for standard and imbalanced data. IEEE Trans Cybern 43(6):1672–1687
Article Google Scholar
Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. In: ICML 2003 workshop on learning from imbalanced data sets II, Washington, DC, pp 49–56
Wu G, Chang EY (2005) KBA: Kernel boundary alignment considering imbalanced data distribution. IEEE Trans Knowl Data Eng 17(6):786–795
Article Google Scholar
Oh S, Lee MS, Zhang BT (2010) Ensemble learning with active example selection for imbalanced biomedical data classification. IEEE/ACM Trans Comput Biol Bioinf 8(2):316–325
Google Scholar
Liu Y, Yu X, Huang JX, An A (2011) Combining integrated sampling with SVM ensembles for learning from imbalanced datasets. Inf Process Manag 47(4):617–631
Article Google Scholar
Ertekin S, Huang J, Giles CL (2007) Active learning for class imbalance problem. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp 823–824
Fu J, Lee S (2013) Certainty-based active learning for sampling imbalanced datasets. Neurocomputing 119:350–358
Article Google Scholar
Kyrkilis G, Chaloulakou A, Kassomenos PA (2007) Development of an aggregate air quality index for an urban mediterranean agglomeration: relation to potential health effects. Environ Int 33(5):670–676
Article Google Scholar
Chelani AB, Rao CC, Phadke KM, Hasan MZ (2002) Formation of an air quality index in India. Int J Environ Stud 59(3):331–342
Article Google Scholar
Fan S, Hazell PB, Thorat S (1999) Linkages between government spending, growth, and poverty in rural India (Vol 110). Intl Food Policy Res Inst
Deswal S, Verma V (2016) Annual and seasonal variations in air quality index of the national capital region, India. Int J Environ Ecol Eng 10(10):1000–1005
Google Scholar
CPCB (2020) Dataset: https://app.cpcbccr.com/ccr/#/caaqm-dashboard-all/caaqm-landing/data.
Maratea A, Petrosino A, Manzo M (2014) Adjusted F-measure and kernel scaling for imbalanced data learning. Inf Sci 257:331–341
Article Google Scholar
Vapnik VN (1995) The nature of statistical learning. Theory
Wang L (Ed.) (2005) Support vector machines: theory and applications (Vol 177). Springer, New York
Foody GM, Mathur A (2004) Toward intelligent training of supervised image classifications: directing training data acquisition for SVM classification. Remote Sens Environ 93(1–2):107–117
Article Google Scholar
Powers, D. M. (2011). Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation.
Huang H, Xu H, Wang X, Silamu W (2015) Maximum F1-score discriminative training criterion for automatic mispronunciation detection. IEEE/ACM Trans Audio Speech Lang Process 23(4):787–797
Article Google Scholar
Buckland M, Gey F (1994) The relationship between recall and precision. J Am Soc Inf Sci 45(1):12–19
Article Google Scholar
Wikipedia (2021) Confusion matrix. https://en.wikipedia.org/wiki/Confusion_matrix
Hastie T, Rosset S, Zhu J, Zou H (2009) Multi-class adaboost. Stat Interface 2(3):349–360
Article MathSciNet MATH Google Scholar
Schapire RE (2013) Explaining adaboost. In: Empirical inference (pp 37–52). Springer, Berlin
Schapire RE, Freund Y (2013) Boosting: foundations and algorithms. Kybernetes
Pal SK, Mitra S (1992) Multilayer perceptron, fuzzy sets, classifiaction
Tang J, Deng C, Huang GB (2015) Extreme learning machine for multilayer perceptron. IEEE Trans Neural Netw Learn Syst 27(4):809–821
Article MathSciNet Google Scholar
Chen MS, Manry MT (1993) Conventional modeling of the multilayer perceptron using polynomial basis functions. IEEE Trans Neural Netw 4(1):164–166
Article Google Scholar
Bustamante C, Garrido L, Soto R (2006) Comparing fuzzy naive bayes and gaussian naive bayes for decision making in robocup 3d. In: Mexican International Conference on Artificial Intelligence, Springer, Berlin, pp 237–247
Griffis JC, Allendorfer JB, Szaflarski JP (2016) Voxel-based Gaussian naïve Bayes classification of ischemic stroke lesions in individual T1-weighted MRI scans. J Neurosci Methods 257:97–108
Article Google Scholar
Wu J, Coggeshall S (2012) Foundations of predictive analytics. CRC Press
Book MATH Google Scholar
Ruggieri M, Plaia A (2012) An aggregate AQI: comparing different standardizations and introducing a variability index. Sci Total Environ 420:263–272
Article Google Scholar
Friedman JM (1996) The effects of drugs on the fetus and nursing infant: a handbook for health care professionals. Johns Hopkins University Press, Baltimore
Google Scholar
Cleland JG, Van Ginneken JK (1988) Maternal education and child survival in developing countries: the search for pathways of influence. Soc Sci Med 27(12):1357–1368
Article Google Scholar
Anderson JO, Thundiyil JG, Stolbach A (2012) Clearing the air: a review of the effects of particulate matter air pollution on human health. J Med Toxicol 8(2):166–175
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Institute of Science, Banaras Hindu University, Varanasi, India
Shwet Ketu & Pramod Kumar Mishra

Authors

Shwet Ketu
View author publications
You can also search for this author in PubMed Google Scholar
Pramod Kumar Mishra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shwet Ketu.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix 1

See Table 11.

Table 11 List of abbreviations

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ketu, S., Mishra, P.K. Scalable kernel-based SVM classification algorithm on imbalance air quality data for proficient healthcare. Complex Intell. Syst. 7, 2597–2615 (2021). https://doi.org/10.1007/s40747-021-00435-5

Download citation

Received: 09 December 2020
Accepted: 09 June 2021
Published: 29 June 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s40747-021-00435-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Scalable kernel-based SVM classification algorithm on imbalance air quality data for proficient healthcare

Abstract

Similar content being viewed by others

Imbalanced data classification algorithm with support vector machine kernel extensions

Air Quality Monitoring and Classification Using Machine Learning

Study on Class Imbalance Problem with Modified KNN for Classification

Explore related subjects

Introduction

Related work

Materials and methods

Data

Proposed methodology

Basic support vector machine algorithm (SVM)

Kernel function selection

Testing of Chi-square

Computing the weighting factor

Computing the parameter \({\boldsymbol{z}}_{\boldsymbol{j}}\)

Description of the proposed algorithm

Statistical analysis

Accuracy

Precision

Recall

F1-score

True negative rate (TNR)

Negative predictive value (NPV)

False negative rate (FNR)

False positive rate (FPR)

False discovery rate (FDR)

False omission rate (FOR)

Results

Model comparison

Performance evaluation of classification algorithms

Discussion

Performance evaluation of classification algorithms

Effect on healthcare

Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix 1

Appendix 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation