Introduction

In the machine learning paradigms, the classification of the new objects based on similar instances is one of the crucial tasks. The classification task becomes more complicated when one of the classes contains fewer instances than the other class [1]. The class imbalance problem is nothing but an unequal distribution of the data among the various classes. In the class imbalance problem, the majority of data samples belong to individual classes, and the rest of the data samples belong to the other classes. With respect to the binary class imbalance problem, one class contains the maximum number of data samples, and the other class contains only a few data samples [2]. The class which contains the maximum number of samples is said to be the majority class, and the class with the minimal number of samples is said to be the minority class [3, 4].

In the field of machine learning, it is one of the challenging tasks for classification algorithms to learn from imbalanced data. We are facing data imbalance issues in almost all the domains, or we can say that it is quite a common problem in all the fields. The areas which are facing these issues are the medical domain [5, 6], marketing doming, image classification [7], agriculture, big data domain [8,9,10], IoT [11,12,13], and so on [14,15,16]. Class imbalance is one of the critical issues in machine learning paradigms. If the classification algorithms are biased towards the majority class, then the accuracy of the classification algorithms will suffer much. Thus, if the new sample will come for classification, then it will be classified into the majority class because the classifier has lower prediction accuracy toward the minority class. This situation is highly unappropriated and a severe matter of concern [17].

Nowadays, a drastic change in the air pollution level has been seen [18]. The pollution level in metropolitan cities is increasing, which not a very good sign for us. For making the environment healthier and comfortable, the air pollution level should be minimal. There are various liable factors that are making air polluted [19,20,21,22]. Some of them are directly, and some are indirectly participating in polluting the air. These pollutants are coming from various domains such as from industry, from transportation services, from daily traffic, from the thermal power plant, from various home appliances, garbage material from industries, hospitals and homes, and so on. The high level of air pollution can harm humans, animals, as well as botanical plants too [23]. Consequentially, a lot of new cases related to the respiratory system have been seen, which is the impact of bad air quality on human beings. It is also affecting crop quality and overall crop production. Thus, to reduce the effect of air pollution, we have to correctly classify the pollution level in real-time. From time to time, many researchers had contributed their approaches, which were accurate to some extent [24,25,26,27,28]. But due to the imbalanced nature of data, these models were not giving the correct prediction of the classes [29,30,31,32].

Building the classifier using the imbalanced datasets is one of the difficult tasks. In the classification task of imbalanced datasets, the minority class always suffers from the majority class because the classification model is biased with the majority class [33, 34]. As resultant, if any new sample comes for the classification, then it is classified in the majority class. This immanent need and gigantic interest motivate the researchers to deal with the class imbalance issues. From time to time, many researchers had given valuable solutions to deal with this class imbalance problem. These approaches were beneficial and capable of solving the problem to some extent by improving classifiers' performance. Most of the solutions were proposed for the binary class imbalance problems, which were not suitable for the multi-class imbalance problem. These limitations motivate us to deal with multi-class imbalance problems and also encourages us to give a possible contribution that will able to solve the multi-class imbalance problem. The contribution which we have worked on are:

  • This solution is designed in a way that, which is well suited for both binary class and multi-class imbalance problems.

  • The solution is based on algorithmic modification rather than data resampling at the processing phase.

  • In our solution, the new kernel selection function has been proposed.

In this paper, the scalable kernel-based SVM (Support Vector Machine) classification algorithm has been proposed, which is capable of dealing with the multi-class data imbalance problem. First of all, the approximate hyperplane is gained using the standard SVM algorithm. After that, the weighting factor and parameter function for every support vector at each iteration is calculated. The values of these parameters are calculated using the Chi-square test. After that, the new kernel function or kernel transformation function is calculated. With the help of this kernel transformation function, the uneven class boundaries have been expanded, and the skewness of the data has been compensated. Therefore, the approximate hyperplane can be corrected by the proposed algorithm, and it can also resolve the performance degradation issue. In this study, we have also discussed the impact of air pollution on human health.

The rest of this paper has been arranged as follows. The related work has been drawn in “Related work”. A brief discussion about the datasets, the working of the proposed algorithm with the mathematical foundation, and ten performance evaluation metrics have been briefly illustrated in “Materials and methods”. The results of standard methods, existing literature methods and proposed classification algorithm have been presented in “Results”. In “Discussion”, the comprehensive conversation about the classification results and the effect of bad air quality on health have been discussed. Concluding remarks with the future scope have been drawn in “Conclusion”.

Related work

Building the classifier using the imbalanced datasets is one of the difficult tasks. In the classification task of imbalanced datasets, the minority class always suffers from the majority class because the classification model is biased with the majority class [33, 34]. As resultant, if any new sample comes for the classification, then it is classified in the majority class. From time to time, many strategies have been made to overcome class imbalance issues. These proposed strategies are either work on the algorithm level or at the data level.

The data level approach is based on the resampling technique. Many classification algorithms such as SVM, naïve Bayes, C4.5, AdaBoost, and so on are using the resampling technique to deal with the data imbalance problem. The resampling task is consisting of two subtasks, i.e., under-sampling and over-sampling [35, 36]. The under-sampling technique is the process of filtering out the irrelevant sample from the dataset, and in the oversampling technique, we generate the new synthetic data. Two effective under-sampling methods had been proposed by Liu et al. [37], i.e., BalanceCascade and EasyEnsemble. In the BalanceCascade technique, the sample that was correctly classified at each step was removed and did not participate in the further classification task. In the EasyEnsemble method, the majority class was divided into various subsets. These subsets had been used as input for the learner. The SMOTE stands for Synthetic Minority Over- Sampling Technique. It is one of the intelligent techniques which is based on an oversampling approach [36]. The oversampling in SMOTE is done by generating the syntactic samples for minority classes. The adaptive oversampling method had been proposed by Wang et al. [38], which was based on the data density approach. The binary class oversampling approach had been proposed by Geo et al. [39], which was based on the probability density function. Gu et al. [40] had discussed the data mining approaches on an imbalanced dataset.

The algorithmic level approaches are designed to bias the learning process to reduce the participation of the majority class and improve classifier performance. The solution for algorithmic level approaches has mainly consist of modification in algorithms, cost-sensitive learning, ensemble learning, and active learning.

The cost-sensitive learning approach is based on the concept of asymmetric cost assignment policy by minimizing the cost of misclassified samples. The cost minimization in a cost-sensitive approach is the process of penalizing the misclassified class with the penalty. But giving the desired penalty at every class level is an arduous task [41, 42]. Three cost-sensitive boosting algorithms for the classification of an imbalanced dataset in the AdaBoost framework had been introduced by Sun et al. [41]. The cost-sensitive SVM (support vector machine) had been proposed by Wang [43] to deal with the data imbalance problem.

To deal with the data imbalance problem, some researchers have done the modification in the algorithm level. The amendment in the algorithm level can be done at the classifier level by optimizing the classifier. The fuzzy-based SVM was proposed by Batuwita and Palade [44] to deal with imbalanced data in the presence of noise and outliers. Cano et al. [45] had proposed imbalanced data classification based on weighted gravitation. The adjusted boundary-based class boundary alignments with improved SVM performance had been proposed by Wu and Chang [46, 47].

The ensemble learning approach was designed to increase the accuracy of the classification algorithm. In this approach, several classifiers are used to train the model, and the decision output of these classifiers is incorporated into a single class. This final output is used for decision-making [3]. Bagging and boosting are the vital machine learning algorithm in ensemble learning paradigms [3]. The active sample election technique was used by Oh et al. [48] to resolve the data imbalance problem. The sampling techniques (both under-sampling and over-sampling) had been integrated with the SVM to improve the classifier performance by Liu et al. [49].

The active learning approach is one of the exceptional cases of machine learning paradigms that have been used to label the new data sample points with the help of desired outputs by earning query interactively with a user [50]. The CBAL, which is a certainty-based active learning algorithm, was proposed by Fu and lee in 2013 [51] to solve the issue of data imbalance. Based on the various existing literature, the classification algorithms used to deal with the data imbalance problem have been briefly shown in Table 1.

Table 1 Classification algorithms to deal with data imbalance problem

Materials and methods

In this section, we will talk about the material and method which has been used in the experimental analysis. This section consists of three subsections, i.e., dataset illustration, proposed algorithm, and statistical measures. In the first subsection, the sensor-based CPCB dataset of Delhi has been discussed. In the second subsection, the proposed scalable kernel-based SVM classification algorithm has been discussed with its mathematical foundation. In the third subsection, a brief discussion about the performance evaluation metrics has been presented.

Data

For this study, we have taken the sensor-based CPCB data of Delhi city, which is the most polluted city in India. The reason behind taking this benchmark data is that the continuous monitoring of air quality with more than 200 base stations in approximately 20 states is maintained by the CPCB (Central Pollution Control Board). All the data from these stations are openly accessible from the website of CPCB. As far as Delhi is concerned, there are 37 base stations that are monitoring data continuously (24*7).

As we know, India comes second with respect to the total number of populations live after china [52, 53]. The massive growth in population is one of the key reasons for increasing pollution levels. Delhi is the capital and industrial hub of India; therefore, the population density of Delhi people is more than the respect of other cities. Resultant, the pollution caused by industrial waste and vehicles are the main reason for increasing the pollution level of Delhi [54, 55]. High discharge of various gasses, i.e., NO2, NH3, NO, CO2, O3, and CO, with additional factors like wind direction, wind speed, temperature, and relative humidity make the air of Delhi heavily polluted and toxic. The toxic particles and other harmful particles are dissoluble in the air. Thus, living in such a polluted environment may cause some severe diseases. Even death is also possible in more severe cases. So, we have to take preventive measures to enhance the excellent quality of life by reducing the pollution levels for human well-being.

For the experimental analysis, the dataset from the Indian central pollution control board (CPCB) of capital Delhi has been taken [56]. The dataset has been extracted from various sensor-based devices. These sensor-based devices have been placed in multiple locations of Delhi and have been shown in Fig. 1. The figure has been plotted with the help of the longitude, and latitude of various data collection points falls of the Delhi region. The 37 data collection centers of Delhi have been plotted with the red circle in Fig. 1. We have taken the data from January 1, 2019, to October 1, 2020. The data has been recorded twenty-four times a day, which means, on an hourly basis. The CPCB air quality dataset is enriched with numerous liable features that can play an essential role in air quality classification tasks. These responsible features are PM10 (Concentration of Inhalable Particles), SO2 (Sulfur Dioxide), PM2.5 (Fine Particulate Matter), O3 (ozone), NOx (Nitrogen Oxide), NO2 (Nitrogen Dioxide), NO (Nitrogen Monoxide), NH3 (Ammonia), CO (Carbon Monoxide), AQI (Air Quality Index), WD (Wind Direction), C6H6 (Benzene), WS (Wind Speed), RH (Relative Humidity), SR (Solar Radiation), BP (Bar Pressure) and AT (Absolute Temperature). The dataset taken for the classification task contains 16 columns and 332,880 rows or 16 columns and 8760 rows at each base station (37 base stations are taken into consideration).

Fig. 1
figure 1

Air quality data collection centers in the Delhi Region

In this research work, the classification task has been performed on the CPCB air quality dataset, which contains various attributes. Only those attributes have been taken into consideration, which is responsible for making air pollution levels high. These attributes are the concentration of inhalable particles (PM10), sulfur dioxide (SO2), fine particulate matter (PM2.5), ozone (O3), nitrogen oxide (NOx), nitrogen dioxide (NO2), nitrogen monoxide (NO), ammonia (NH3), carbon monoxide (CO), Air Quality Index (AQI) and, so on.

In Table 2, the various features of the dataset which are participating in the classification task have been presented. The various features have been explored with the help of several parameters, i.e., name of the variable with the abbreviation, nature of data, variable measuring unit, the period of data collection, variable type, and at last data extraction source.

Table 2 Substantial features of the dataset a quick look

Table 3 presents the various features of the dataset with the help of several parameters, such as the name of the variable with its mean values, measuring unit, standard derivation, and actual and prescribed range of variables. These liable features have been used in the classification task.

Table 3 Variable description

Table 4 represents the data that has been come from the preprocessing and taken for the experimental analysis. This preprocessed data contains six classes, 270,596 samples, and ten attributes in each sample. The class-wise distribution of the dataset is 13,452, 47,910, 93,167, 55,045, 30,421, and 30,601 for class one to class six. The class imbalances ratio among the classes is 6.92.

Table 4 Preprocessed dataset description

Table 5 represents the description of the Air Quality Index (AQI), which contains the AQI range, appropriate AQI labeling, and class level. The labeling has been done into six parts according to the range from 0 to more than 400 [56]. The linking of the CPCB dataset with the AQI range is also established here.

Table 5 Air quality description

Proposed methodology

The primary aim of the proposed algorithm is to deal with the data imbalance problem efficiently. The proposed algorithm is based on the concept of the adjusting kernel scaling method (AKS) [57] to deal with the multi-class imbalanced dataset. In this paper, we have proposed the SVM classification, which has been integrated with the adjusting kernel scaling method. In this section, a detailed discussion about the proposed algorithm has been presented.

Basic support vector machine algorithm (SVM)

Support Vector Machine (SVM) is a widely used and well-known machine learning algorithm for data classification. The SVM algorithm had been proposed by Vapnik et al. [58] in 1995. The primary aim of designing this algorithm was to map the input data into high dimensional space with the help of kernel function so that the classes can be linearly separable [58,59,60]. In the cased of the binary class problem, the maximum margin that can separate the hyperplanes is presented:

$$w.x+b=0$$
(1)

Based on the optimal pair \({(w}_{0}{,b}_{0})\), the decision function for SVM is represented by:

$$f\left(x\right)=\sum _{j\in SV}{\lambda }_{j}{y}_{j } \langle x.{x}_{j } \rangle +b$$
(2)

where, \({\lambda }_{j}\) is support vector, \({x}_{j}\) is data sample and \(j=\mathrm{1,\,2},\dots ,C\).

Figure 2 shows the hyperplane with maximum separating margin and support vectors in the SVM algorithm paradigm.

Fig. 2
figure 2

Hyperplane with Support Vectors in the SVM Algorithm Paradigm

For higher dimensional feature space, the value of \( \langle x.{x}_{j} \rangle \) is replaced by the kernel function \(K \langle {x}_{j}.{x}_{\mathrm{i}} \rangle \) that is:

$$K \langle {x}_{j }.{x}_{i} \rangle = \langle {x}_{j }.{x}_{i} \rangle $$
(3)

Kernel function selection

In this section, the kernel function has been chosen from the standard SVM for approximately computing the boundaries' position. Initially, the dataset \(P\) is split into various samples which are \({P}^{1},{P}^{2},{P}^{3},\dots ,{P}^{j}\) and after that, the kernel transformation function is applied that is defined in the below equation.

$$f\left(x\right)= \left\{\begin{array}{l}{e}^{{-z}_{1}{h\left(x\right)}^{2}},\quad if x\in {P}^{1} \\ {e}^{{-z}_{2}{h(x)}^{2}}, \quad if x\in {P}^{2} \\ . \\ . \\ . \\ {e}^{{-z}_{C}{h(x)}^{2}, } \quad if x\in {P}^{C}\end{array}\right.$$
(4)

where, \(h(x)=\sum _{j\in SV}{\lambda }_{j}{y}_{j} \langle x.{x}_{j} \rangle +b\) (where, \({\lambda }_{j}\) is support vector), \({P}^{j}\) is jth sample of the training set, the value of the parameter \({\mathrm{z}}_{\mathrm{j}}\) is computed from the chi-square test \({(\chi }^{2}),\) which is explained in Sect. 2.2.2 and\(j=\mathrm{1,2},\dots ,C\).

Testing of Chi-square

The Chi-square test \({(\chi }^{2})\) is one of the important statistical tests applied to sets of categorical features to determine the frequency distribution-based association among the categorical feature groups. In other words, we can say that it is used to evaluate the correlation among the groups. The significance of calculating the chi-square test is to establish the relationship among the samples of each category and parameter \(z_{j}\). The mathematical formulation of evaluating the chi-square test \({(\chi }^{2})\) is:

$${\chi }^{2}= \sum \frac{{{(f}_{0}-{f}_{e})}^{2}}{{f}_{e}}$$
(5)

where, \(f_{e}\) and \(f_{o}\) are denoted as expected frequency and observed frequency, respectively.

Computing the weighting factor

The weighting factor is one of the important and difficult issues while dealing with the class imbalance problem. It is very difficult because assigning the appropriate weight for overcoming the class imbalance problem makes it complex. The simple way to deal with such problems is to give less weight to the majority class and more weight to the minority class by satisfying the weight condition \({z}_{i}\in (\mathrm{0,1})\).

The formulation of the weighting factor setting method has been used in the proposed algorithm to deal with the multi-class imbalance problem. In other words, we can say that the method that has been used to compensate for the uneven data distribution is defined as:

$${w}_{j}= \frac{N/{n}_{j}}{\sum _{j=1}^{C}N/{n}_{j}}$$
(6)

where, \(N\) and \(C\) denote the training sample size and category size, respectively. nj indicates the sample size of every category with \(j=\mathrm{1,\,2},\dots ,C\).

Computing the parameter \({\boldsymbol{z}}_{\boldsymbol{j}}\)

Let \(P\) is the dataset, which includes the \(N\) number of samples with \(C\) categories. The value of the parameter \({z}_{j}\) is calculated using Eqs. 2 and 3. The value of the chi-square \({(\chi }^{2})\) in optimal distribution is,

$${\chi }^{2}= \sum _{j=1}^{C}\frac{{{(n}_{j}-N/C)}^{2}}{N/C}$$
(7)

where \({n}_{j}\) = number of samples in jth category and \(j={1,2},\dots ,C\)

Let, \({X}_{j}=\frac{{{(n}_{j}-N/\mathrm{C})}^{2}}{N/C}\)

Then,

$${\chi }^{2}= \sum _{j=1}^{C}{X}_{j}$$
(8)

So, the parameter \({z}_{j}\) can be defined as

$${z}_{j}= {w}_{j}\times \frac{{X}_{j}}{{\chi }^{2}}$$
(9)

From the Eq. (8) put the value of \({\chi }^{2}\)

$${z}_{j}= {w}_{j}\times \frac{{X}_{j}}{\sum _{j=1}^{C}{X}_{j}}$$
(10)

Description of the proposed algorithm

The flow chart of the proposed algorithm has been shown in Fig. 3. First of all, the cleaning of the CPCB air quality dataset is performed, and after that, this proposed data is served to the classification algorithm for obtaining the initial partition. In the second step, we calculate the value of the weighting factor \({w}_{j}\) and parameter \({z}_{j}\) for every support vector in each iteration. The value of this parameter is calculated using the Chi-square test. In the next step, the kernel transformation function is calculated, and finally, the classification model is retrained using the new computed kernel matrix \({K}_{mt}\).

Fig. 3
figure 3

Flow chart of the proposed algorithm

The algorithm for the proposed classification model contains 11 steps and these steps are described in Algorithm 1.

figure a

Statistical analysis

In this section, the various statistical measures used to evaluate the performance of the algorithms have been discussed. Statistical analysis is one of the essential tasks that help us pick the best algorithm based on their performance. In this paper, some statistical measures for evaluating the proposed algorithm and existing algorithms have been chosen to find the best algorithm among them. The statistical measures which have been taken into consideration are accuracy, precision, recall, f1-score, and TNR, NPV, FNR, FPR, FDR, FOR [61,62,63,64]. With the help of these ten evaluation measures, we can determine the appropriate algorithm that can perform the classification task more effectively and efficiently.

Accuracy

The accuracy with respect to the classification task is the percentage of instances that are correctly classified. In other words, we can say that accuracy is the percentage ratio of the correctly predicted class over the entire testing class. The accuracy formulation has been described in the below equation [61].

$$Accuracy=\frac{(TP+TN)}{(TP+TN+FP+FN)} \times 100 \%$$
(11)

where \(TP,FP\) are the number of true positive and false positive respectively, and \(FN,TN\) represents the number of false negatives and true negatives, respectively.

Precision

The precision, with respect to the classification task, is used to quantify the number of predicted positive classes that are actually falling under the positive class. In other words, we can say that precision is the ratio of the true positive class over the total number of a truly positive and false-positive class. The precision formulation has been described in the below equation [61, 63].

$$Precision=\frac{(TP)}{(TP+FP)}$$
(12)

where, \(TP\mathrm{a}\mathrm{n}\mathrm{d}FP\) are the numbers of true positive and false positive, respectively.

Recall

The recall, with respect to the classification task, is used to quantifies the number of predicted positive class actually falls out of all positive instances in the dataset. In other words, we can say that recall is the ratio of the true positive class over the total number of a truly positive and false negative class. The recall formulation has been described in the below equation [63].

$$Recall=\frac{(TP)}{(TP+FN)}$$
(13)

where, \(TP\mathrm{a}\mathrm{n}\mathrm{d}FN\) are the numbers of true positive and false negative, respectively.

F1-score

The F1- score is also known as F Measure or F Score. The F1-score, with respect to the classification task, is used to quantifies the balance among the recall and precision. In other words, we can say that F1-score is the twice product of recall and precision over the summation of recall and precision. The f1-score formulation has been described in the below equation [62].

$$f1=\frac{2\times \left(precision \times recall\right)}{precision+recall}$$
(14)

True negative rate (TNR)

The TNR, with respect to the classification task, is used to quantifies the specificity or true negative rate. In other words, we can say that TNR is the ratio of the true negative class over the total number of a truly negative and false positive class. The TNR formulation has been described in the below equation [61, 64].

$$TNR =\frac{(TN)}{(TN+FP)}$$
(15)

where, \(TN\,\mathrm{a}\mathrm{n}\mathrm{d}\,FP\) are the numbers of true negative and false positive, respectively.

Negative predictive value (NPV)

The NPV, with respect to the classification task, is used to quantifies the ratio of negative predictive value. In other words, we can say that NPV is the ratio of the true positive class over the total number of a truly positive and false negative class. The NPV formulation has been described in the below equation [61, 64].

$$NPV=\frac{(TN)}{(TN+FN)}$$
(16)

where, \(TN\,\mathrm{a}\mathrm{n}\mathrm{d}\,FN\) are the numbers of true and false negative, respectively.

False negative rate (FNR)

The FNR, with respect to the classification task, is used to quantifies the miss rate. In other words, we can say that FNR is the ratio of the false-negative class over the total number of a truly positive and false negative class. The FNR formulation has been described in the below equation [61, 64].

$$FNR =\frac{(FN)}{(FN+TP)}$$
(17)

where, \(TP\,\mathrm{a}\mathrm{n}\mathrm{d}\,FN\) are the numbers of true positive and false negative, respectively.

False positive rate (FPR)

The FPR, with respect to the classification task, is used to quantifies the fall-out rate. In other words, we can say that FPR is the ratio of the false positive class over the total number of a truly negative and false positive class. The FPR formulation has been described in the below equation [61, 64].

$$FPR=\frac{(FP)}{(FP+TN)}$$
(18)

where, \(FP\,\mathrm{a}\mathrm{n}\mathrm{d} \; TN\) are the numbers of false-positive and true negative, respectively.

False discovery rate (FDR)

The FDR is the ratio of the false positive class over the total number of a truly positive and false-positive class. The FDR formulation has been described in the below equation [61, 64].

$$FDR=\frac{(FP)}{(FP+TP)}$$
(19)

where, \(FP\,\mathrm{a}\mathrm{n}\,\mathrm{d}TP\) are the numbers of false and true negative, respectively.

False omission rate (FOR)

The FOR is the ratio of the false-negative class over the total number of a truly negative and false negative class. The FOR formulation has been described in the below equation [61, 64].

$$FOR=\frac{(FN)}{(FN+TN)}$$
(20)

where, \(TN\,\mathrm{a}\mathrm{n}\,\mathrm{d}FN\) are the numbers of true and false negative, respectively.

Results

In this section, we will talk about the classification result based on classification algorithm, i.e., Ada Boost Algorithm (ADB) [65,66,67], Multilayer Perceptron Algorithm (MLP) [68,69,70], Gaussian NB Algorithm (GNB) [71,72,73], standard Support Vector Machine Algorithm (SVM) [58,59,60], existing literature methods and proposed scalable-kernel based SVM algorithm.

Model comparison

Identifying the best classification model capable of dealing with class imbalance problems is one of the complex tasks. The CPCB air quality dataset has been taken for the experimental analysis. In Fig. 4, the x-axis denotes the various classes, and the y-axis indicates the number of data samples in the multiple classes. From Fig. 4, it is clear that our dataset contains uneven class distribution, or we can say that it is imbalanced. Therefore, it becomes more challenging to handle such a situation by the traditional classification models. The class-wise distributions of the dataset based on sample size are: the first class consists of 13,452 samples, the second one contains 47,910 samples, the third one has 93,167 samples, the fourth class has 55,045 samples, the fifth class contains 30,421 samples, and the last class contains 30,601 samples. The dataset also has a 6.92 class imbalance ratio.

Fig. 4
figure 4

Class wise distribution of CPCB dataset

The primary aim of this research work to find out the best classification model which can deal with the class imbalance problem. From time to time, many researchers had given valuable solutions to deal with this class imbalance problem. Most of the solutions were proposed for the binary class imbalance problems and which did not find suitable for the multi-class imbalance problem. These limitations motivate us to modify the algorithm that can efficiently deal with multi-class and binary class imbalance problems without compromising the algorithm's performance. This classification will also be helpful for making the possible solution toward proficient healthcare.

For the experimental evaluation, the four well-established traditional classification algorithms and existing literature methods with our proposed algorithm have been taken. Our proposed algorithm has been compared with other algorithms to determine suitability, correctness, and efficiency. The ten performance validation measures have measured the performance of all the classification algorithms. The tenfold cross-validation policy has been used.

Figure 5 shows an overview of the classification algorithms, which have been used in the classification task. The four classification algorithms and existing literature methods have been compared with our proposed algorithm to determine the performance of the proposed classifier. The algorithms which have been used in the classification task are ADB (Ada Boost Algorithm), MLP (Multilayer Perceptron Algorithm), GNB (Gaussian NB Algorithm), standard SVM (Support Vector Machine Algorithm), existing literature methods, and proposed scalable-kernel based SVM algorithm.

Fig. 5
figure 5

Classification models for experimental evaluation

Performance evaluation of classification algorithms

The performance evaluation of the classification algorithm has been divided into two parts. In the first part, the CPCB air quality dataset of the whole Delhi region has been taken, which has been come from the 37 distributed base stations. All the data has been served as a single file to perform the classification task. The classification result of all algorithms has been evaluated in the form of precision, recall, F1 score, TNR, NPV, FNR, FPR, FDR, FOR, and accuracy. The validation of the classification algorithm has been performed based on the classification accuracy. As we know, our dataset contains imbalanced class distribution that may affect the classification algorithms' performance. All standard models perform well except Ada Boost Classifier (ADB). The ADB classifier achieves the lowest accuracy of 59.72 among all the classifiers. The standard SVM classifier, MLP classifier, and Gaussian NB perform quite well in imbalanced class distribution. But if we compare it to our proposed SVM classifier, then these classifiers are lost in the battle. Our proposed algorithm wins the battle with the highest accuracy of 99.66 among all the other models. The detailed analysis of the classification results has been shown in Table 6.

Table 6 Performance evaluation of classification algorithms I

In the second part of the performance evaluation, we have taken the CPCB dataset, which is coming from 37 places in Delhi. The proposed algorithm achieves the highest accuracy of 99.66% among the existing literature methods. It is also efficient for dealing with class imbalance problems without compromising performance. Performance evaluation of existing literature method Vs proposed classification algorithm has been presented in Table 7.

Table 7 Performance evaluation of existing literature methods vs proposed classification algorithm

In the second part of the performance evaluation, we have taken the individual data of each base station of CPCB, which is plotted at 37 places in Delhi. The 37 data files have been used as input datasets for performing the classification task with the help of various classification algorithms. The details about the acronyms used in Table 8 have been defined in Appendix 1. Our proposed algorithm has performed exceptionally well in this rigorous analysis for all the datasets lying from A1 to A37. Our proposed algorithm achieves the highest average accuracy of 99.72 (average of A1 to A 37) among all the algorithms. It is also efficient for dealing with class imbalance problems without compromising performance. The detailed analysis of the results has been shown in Table 8.

Table 8 Performance evaluation of classification algorithms II

Discussion

Numerous associated factors exist which may play a crucial role in affecting air quality. Some factors directly and some indirectly participate in polluting the air. Those pollutants which are air soluble are hazardous to human health. The poor diffusion condition is one of the crucial factors which play a vital role in increasing the level of pollutant. The drive of air partials from the high concentration space to the low concentration space is known as diffusion. Before performing the classification task, the preprocessing of the dataset is performed. Preprocessing is a process of dropping missing values and the unusual object from the datasets. The dataset is consist of numerous liable features such as PM10 (Concentration of Inhalable Particles), SO2 (Sulfur Dioxide), PM2.5 (Fine Particulate Matter), O3 (ozone), NOx (Nitrogen Oxide), NO2 (Nitrogen Dioxide), NO (Nitrogen Monoxide), NH3 (Ammonia), CO (Carbon Monoxide), AQI (Air Quality Index), WD (Wind Direction), C6H6 (Benzene), WS (Wind Speed), RH (Relative Humidity), SR (Solar Radiation), BP (Bar Pressure) and AT (Absolute Temperature). The correlation on the preprocessed data is calculated to find out the relation between the class and the liable factors.

In Fig. 6, the relationship between class and respondent factors has been shown. With the help of correlations, we can easily find which responsive factors are highly correlated with the class.

Fig. 6
figure 6

Correlation coefficients of liable factor

Performance evaluation of classification algorithms

For the experimental analysis, the dataset from the Indian central pollution control board (CPCB) of the capital Delhi has been taken. The data from January 1, 2019, to October 1, 2020, has been used for training and testing purposes. The tenfold cross-validation policy has been used. Cross-validation is a technique to assess models by partitioning the given data sample into the training and testing sets. The training set is used to train the model whereas the testing set to evaluate the model. In k-fold cross-validation, the given data sample is randomly divided into the k subsamples of equal size. Where the k-1 subsample is used for training the model and a single subsample is used for validation purposes. This cross-validation technique is repeated up to k times (k- folds) and each subsample is used exactly once for validation purposes. The single estimation is produced by averaging all the result fall under k-fold. The algorithms which have been used in the classification task are ADB (Ada Boost Algorithm), MLP (Multilayer Perceptron Algorithm), GNB (Gaussian NB Algorithm), standard SVM (Support Vector Machine Algorithm), existing literature methods, and proposed scalable kernel-based SVM algorithm.

In Fig. 7, the experimental results, (i.e. statistical measures based and existing literature methods versus proposed algorithm) of the various classification algorithms on the CPCB dataset of the whole Delhi region have been presented. From the figure, it is clear that our proposed algorithm with the highest accuracy of 99.66 wins the race among all the classification algorithms and existing literature methods. The result of the proposed algorithm is also better than the traditional SVM algorithm. So, it is also clear from the results that our proposed algorithm is efficient for dealing with class imbalance problems without compromise the performance of the algorithm.

Fig. 7
figure 7

Result of the classification algorithms. a Statistical Measures based I. b Statistical Measures based II. c Existing Literature Methods Vs Proposed Algorithm

In Fig. 8, the accuracy-based classification results of the various classification algorithms on the CPCB dataset, specifically A1, A10, A20, A30, and A37 of the Delhi region have been plotted using a bar graph. From the figure, it is clear that our proposed algorithm achieves the highest accuracy throughout the areas and wins the race among the classification algorithms. The results of the proposed algorithm are also better than the traditional SVM algorithm. Thus, it is also clear from the results that our proposed algorithm is efficient for dealing with class imbalance problems along with enhanced performance.

Fig. 8
figure 8

Accuracy-based results of the classification algorithms II

Effect on healthcare

Bad air quality can impact individuals' health and quality of life. The impact of bad air quality may cause problems from minor to severe. It may affect individuals' cardiovascular or circulatory system, respiratory system, excretory system (kidney or urinary), nervous system, endocrine system, circulatory system, digestive system, lymphatic system, integumentary system (skin), and ophthalmic system.

Table 9 shows the AQI range with associated labeling, and the impact of various air levels on health has been shown [56]. The AQI level is divided into the six-range starting from 0–50 and end at greater than 400.

Table 9 Air quality index range with possible health impact

The consequence of high AQI levels on individuals' health has been described in Table 10. The various effects of high AQI levels are divided into three subparts, i.e., short-term impact, long-term impact, and severe impact. It may cause severe problems for those people who are suffering from respiratory diseases. Such people require intensive care, and precaution must be taken to minimize its impact on their health [74,75,76,77].

Table 10 The effect of high air quality index level on person’s health

Conclusion

In numerous classification problems, we are facing the class imbalance issue. This research is focused on dealing with the imbalanced class distribution so that the classification algorithm will not compromise its performance. The proposed algorithm is based on the concept of the adjusting kernel scaling (AKS) method to deal with the multi-class imbalanced dataset. The scalable kernel-based SVM classification algorithm has been proposed and presented in this study. In the proposed algorithm, the kernel function's selection has been evaluated based on the weighting criteria and chi-square test. Using this kernel transformation function, the uneven class boundaries have been expanded, and the skewness of the data has been compensated. For experimental evaluation, we have taken the accuracy-based classification results of the various classification algorithms on the CPCB dataset of Delhi to find and evaluate the performance of our proposed algorithm over the other classification algorithms. Our proposed algorithm with the highest accuracy 99.66% wins the race among all classification algorithms, and the result of the proposed algorithm is even better than the traditional SVM algorithm. The results of the proposed algorithm are also better than the existing literature methods. It is also clear from these results that our proposed algorithm is efficient in dealing with the class imbalance and enhanced performance. In this study, we have also discussed the effect of air pollution on human health, which is possible only if the data are correctly classified. Thus, accurate air quality classification through our proposed algorithm would be useful for improving the existing preventive policies and would also help enhance the capabilities of effective emergency response in the event of the worst pollution.

In the future, this algorithm will be compared with the recent variation of SVM. The proposed algorithm will be tested on other datasets, and we will try to improve its computational methods as well.