1 Introduction

After the first reported case of severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) in Wuhan, China in December 2019, it spread exponentially covering approximately 215 countries worldwide by 28th June 2021 [1]. According to the WHO’s report, it has infected over 180,654,652 people, and 3,920,463 confirmed deaths globally by 28th June 2021. According to the report of the Ministry of Health and Family Welfare, Government of India, there is a total of 5,72,994 active cases, 29,30,9607 cured and discharged and 3,96,730 deaths by 28th June 2021 [2]. Governments made their all efforts to control the spread of COVID-19 at their level, including lockdown, social distancing measures, personal hygiene, testing, tracking, isolation, and trial of drugs already used for other diseases like malaria, HIV, tuberculosis, etc. Finally, vaccination became the main tool to control the spread of COVID-19. In India total of 32,36,63,297 vaccines are vaccinated of which 4.3% are fully vaccinated and 20% of the population are partially vaccinated upto 28th June 2021 [2].

Despite these all-available precautions, the 2nd surge in India was unexpected and affected a large percentage of the population. 2nd surge of COVID-19 spread started in 1st week of April 2021 and declined after the 1st week of June 2021. In nearly two months, the country started to struggle with inadequate of hospital beds, oxygen cylinders, essential medicines, and vaccines all around the country. On 30 April 2021, India became the first country that reported over 4,00,000 newly infected cases in a very single day (24 h). This unexpected speed of infection created a huge demand for basic essentials.

It has been observed that both spikes were reported during the particular climate conditions in India. Therefore, it becomes too necessary to study the impact of weather and atmospheric factor on the spread of COVID-19. Along with weather parameters the impact of air pollutants is equally, important to analyses its impact on COVID 19.

The initial research talks about the transmission of COVID-19 from bats to humans originating from the seafood market in Wuhan, China [1, 3,4,5]. However, the scientific exploration of its route of transmission is requisite. The close contact of humans increases its transmission rate rapidly, through the surface and air [6]. In some recent studies, the presence of coronavirus in the air, fecal swabs, and blood of active cases have been informed [7, 8]. The change of climate conditions provides a favorable environment to grow viruses resulting common flu. The particular climate conditions also affect the transmission rate of the pandemic by presenting emergent or hostile conditions for humans. It was confirmed in cases of past infectious diseases as well as in the case of transmission of the present situation of COVID-19 in some countries. Like the transmission rate of influenza was high at the low temperature and humidity. It is also confirmed in the case of severe acute respiratory syndrome (SARS) in July 2003 which affected by climate change [9].

As reported that COVID-19 has a similar genetic sequence to SARS, therefore, it is highly expected that its transmission rate will be affected by the change in weather parameters [10]. The effect of climate factors on the spread of COVID-19 in different countries has been established in some recent studies [11,12,13,14,15,16,17,18,19]. Besides, the atmospheric factor, air pollution level may also be an affecting factor of the transmission of COVID-19, as reported high-rise of COVID-19 cases in Italy [20]. The effect of concentration of nitrogen oxide on the fatality due to COVID-19 has also been reported in Italy [7]. The effect of lockdown and air pollution level at the spread rate of COVID-19 in Wuhan, China has been also reported [21], etc. Even the atmospheric factors and air pollution levels are highly correlated; the study based on their combined effect on the transmission rate of COVID-19 in India has not been reported yet. The present work tries to cover the combined effect of atmospheric factors and measures of air pollution on the spread of COVID-19 during 28th March 2020 to 20th May 2021 (Exclusively two major surges) period in India.

In the past few years, machine learning has become a very significant tool in the analysis and design of prediction models [22,23,24]. Many machine-learning models have been designed and applied efficiently in the analysis of COVID-19 cases [25,26,27,28,29]. Apart from the most famous deep learning methods, tree-based learning (extreme gradient boosting machine) was successfully applied to find the associations between microRNAs (miRNAs) and human diseases. This motivates us to design the twined gradient boosting machine (GBM) model to analyze the correlation among atmospheric factors (temperature, humidity, pressure, and wind speed), and air pollution (max and min of PM2.5 and PM10) with the infection, recovery, and death cases of COVID-19 daily in different states or places of India.

This paper proposes the following contributions:

  1. i.

    The data for the period of 25th March 2020 to 20th June 2021, has been collected, and analyzed to confirm the suitability of the dataset.

  2. ii.

    The analysis of the impact of atmospheric and air pollutant parameters on the spread of the disease

  3. iii.

    Analysis of the impact of atmospheric and air pollutant parameters on the recovery rate of the patient

  4. iv.

    Analysis of the impact of atmospheric and air pollutant parameters on the mortality rate of the disease

  5. v.

    The worst affected states were analyzed and tested for spread, recovery and mortality rate of COVID-19 separately.

Rest of the paper is organized in the following manner: Sect. 2 describes the process of data collection and its analysis. Section 3 presents the proposed gradient boosting machine (GBM) approach; Experimental setup and results are presented in Sect. 4. The next Sect. 5 discusses the results, and finally Sect. 6 summarizes the critical finding and future research directions in this domain.

2 Data Collection and Analysis

The data of eight atmospheric factors (maximum and minimum temperature, maximum and minimum air pressure, maximum and minimum air humidity, and maximum and minimum wind speed) and four measures of air pollution (maximum and minimum of PM2.5 and PM10) of the 21 significant states or places of India have been collected from the Indian meteorological department (IMD) and Indian central pollution control board (CPCB) during the period of 14th March 2020 to 20th May 2021 on daily basis (433 days) [30, 31]. The cases (number of infected, recovered, and death) of COVID-19 of similar states have been collected from an open-access source and information published by the ministry of health and family welfare, the government of India [32, 33]. The data of some states and union territories were not so significant for COVID-19, so it was not considered at all. The atmospheric factors, measures of air pollution, and cases of COVID-19 were used in combination for further analysis. The missing or doubtful values of the atmospheric factors, air pollution measures for some states at some days were replaced by the previous imputation technique. The variations of minimum and maximum temperature and humidity after imputation are shown in Fig. 1. The minimum and maximum of PM10 and PM2.5 are shown in Fig. 2. The statistics of the dataset are presented in Table 1.

Fig. 1
figure 1

The variation of the temperature (in.oC), humidity (in %)

Fig. 2
figure 2

The variation of PM25 and PM10 on a daily basis

Table 1 Basic statistics of atmospheric and air pollution dataset

The variation in cases of COVID-19 after imputation is shown in Fig. 3. Variations of pressure wind speed are presented in the Table 1.

Fig. 3
figure 3

The variation in COVID-19 cases in India on a daily basis

Eight atmospheric parameters and four measures of air pollution were considered as input in the proposed twined GBM to analyses the correlation and forecast the infected, recovered, and death cases of COVID-19, independently. The total 9,033 instances are taken for the preprocessing that was collected between 14th March 2020 to 20th May 2021 (21 states /places × 433 days). Out of this, 5974 with 17 attributes are taken for the training and remaining 3119 with 17 attributes are taken for the testing. The performance of the proposed GBM was also evaluated by predicting the COVID-19 cases state-wise. The atmospheric factors and air pollution measures were used as input of GBM simultaneously to check their mutual influence on the cases of COVID-19. Moreover, the minimum and the maximum values of the atmospheric factors (temperature, pressure, humidity, and wind speed) used as input of GBM and GBM are suitable in the understanding of their better impact on the distribution of COVID-19 cases. Moreover, to evaluate the impact of air pollution four measures maximum and minimum PM10 and PM2.5 have been included.

3 Gradient Boosting Machine (GBM) Approach

The gradient boosting machine (GBM) is an efficient method in regression analysis since it selects the adaptive characteristics of the dataset in the analysis. The optimal values of the predicted variables are obtained in several iterations by using the values of the dependent variable of the previous iteration and average weights. The GBM approach is implemented using the H2O package in R [33]. The basic steps of the GBM approach are described as follows [34]:

Step-1: For k = 1, 2… K {fk0 = 0}

Step-II: For m = 1, 2, 3 …M

$$ \left\{ {p_{k} \left( x \right) = \frac{{e^{{f_{k} \left( x \right)}} }}{{\mathop \sum \nolimits_{l = 1}^{K} e^{{f_{l} \left( x \right)}} }}} \right.k = 1,\;2\; \ldots \;K $$

Step-III: For k = 1, 2… K

$$ \left\{ {r_{ikm} = y_{ik} - p_{k} \left( {x_{i } } \right),\; i = 1,\;2, \ldots ,N} \right. $$

Fitting regression tree to the targets \(r_{ikm}\), i = 1, 2… N to obtain the terminal regions \(R_{jim} , j = 1, 2, \ldots J_{m}\)

$$ \gamma_{jkm} = \frac{K - 1}{K}\left( {\frac{{\mathop \sum \nolimits_{{x_{i} }} \in R_{jkm} \left( {r_{ikm} } \right)}}{{\mathop \sum \nolimits_{{\mathop \sum \nolimits_{{x_{i} }} \in R_{jkm} }} \left| {r_{ikm} } \right|\left( {1 - \left| {r_{ikm} } \right|} \right)}}} \right),\;j = 1, 2, \ldots ,J_{m} $$
$$ \left. {f_{km} \left( x \right) = f_{k, m - 1} + \mathop \sum \limits_{j = 1}^{{J_{m} }} r_{ikm} I\left( {x \in R_{jkm} } \right)} \right\} $$

\(f_{k} \left( x \right) = f_{kM} \left( x \right)\), where k = 1, 2… K}

The additional classifier can support to further enhancing the performance metrics of the GBM without disturbing its overall speed. Such a combination reduces the process of parameter tuning by providing a parallelizable and distributable feature. Furthermore, it can result in optimal accuracy in big data analysis.

4 Analysis of Experimental Results

4.1 Statistical Analysis of the COVID-19 Dataset

Table 2 summarizes the statistical analysis using ANOVA method of the complete dataset (atmospheric factors, measures of air pollution, and cases of COVID-19). Results indicate that eight atmospheric factors, four pollution measures, and three significant parameters of COVID-19 are significant for further prediction modeling. Specifically, P-value is less than 0.05 indicates the confirmation in contrast to the null hypothesis for each of the dependent and independent variables. The F value represents the ratio of the variation between sample means and variation within the sample. Hence, a large value of F indicates a higher value of variation between sample means than within the sample. It also indicates that the null hypothesis is wrong (Table 2).

Table 2 Statistical analysis of the complete dataset using ANOVA methods

4.2 Experimental Setup

The GBM models was trained with learning rate = 0.01, sample rate = 0.8 the number of trees = 10,000, and folds = 10 on Intel(R) Core (TM) i7-8565U CPU @ 1.80 GHz 1.99 GHz with 8 GB RAM to get the optimal performance.

4.3 Gradient Boosting Machine Model Analysis Results

The optimal GBM model was obtained after tuning the parameters of distribution functions, including the learning rate, the number of trees, folds, etc. Four result-oriented distribution functions were used in GBM, including Poisson, Gaussian, Tweedie, and Gamma out of seven compared distributions (excluding Huber, Laplace, and Quantile). The performance of the twinned GBM model using four different distribution functions is summarized in Table 3 (the rest distribution is discarded). The performance measures, including the goodness-fit-measures (R2), root mean square error (RMSE), mean residual deviance (MRD), and mean average error (MAE) were used to evaluate the efficiency of the GBM. In the training, the optimal prediction performance of the GBM was achieved with the Poisson distribution (R2 = 0.99) in all the three metrics of COVID-19 as infected, recovered, and mortality cases as shown in Table 3. The performance metrics of the GBM model in the forecast of the COVID-19 cases of the test dataset are demonstrated in Figs. 4, 5, and 6, respectively. Figure 4 exhibits a detailed performance analysis of different distribution functions of GBM to forecast the infected, recovered, and mortality cases of COVID-19, respectively for the combined dataset of different states/places of India.

Table 3 Performance metrics of twined GBM in training with combined dataset of India
Fig. 4
figure 4

Correlative capability of twinned GBM in the training for infected cases of COVID-19 in terms of the combined dataset

Fig. 5
figure 5

Correlative capability of twinned GBM in the training for recovery cases of COVID-19 in terms of the combined dataset

Fig. 6
figure 6

Correlative capability of twinned GBM in the training for mortality cases of COVID-19 in terms of the combined dataset

All seven worst-affected states (Maharashtra, Delhi, Karnataka, Kerala, Madhya Pradesh, Uttar Pradesh, and West Bengal) data were tested with the twinned GBM with the four most result-oriented distributions as Poisson, Gaussian, Tweedie, and Gamma distributions. Five performance parameters were used as R2, MSE, RMSE, MAE, and MRD to find the proper correlation and efficiency of the individual model. The test performance of seven states of India was summarized and presented in Tables 4, 5, 6, 7, 8, 9 and 10 and Figs. 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 and 21 respectively below:

Table 4 Performance metrics of twinned GBM in the forecast of infected, recovered and mortality cases of COVID-19 in Maharashtra
Table 5 Performance metrics of twinned GBM in the prediction of COVID-19 in Delhi
Table 6 Performance metrics of twinned GBM in the prediction of infected, recovered and mortality cases of COVID-19 in Karnataka
Table 7 Performance metrics of twinned GBM in the prediction of infected, recovered and mortality cases of COVID-19 in Kerala
Table 8 Performance metrics of twinned GBM in prediction of infected, recovered and mortality cases of COVID-19 in Madhya Pradesh
Table 9 Performance metrics of twinned GBM in the prediction of infection, recovery, and mortality cases of COVID-19 in Uttar Pradesh
Table 10 Performance of twinned GBM in the prediction of infestation, recovery, and mortality cases of COVID-19 in West Bengal
Fig. 7
figure 7

Correlative capability of twined GBM to forecast the infection rate of COVID-19 in Maharashtra

Fig. 8
figure 8

Correlative capability of twined GBM to forecast the recovery rate of COVID-19 in Maharashtra

Fig. 9
figure 9

Correlative capability of twined GBM to forecast the mortality rate of COVID-19 in Maharashtra

Fig. 10
figure 10

Correlative capability of twined GBM to forecast the infection rate of COVID-19 in Delhi

Fig. 11
figure 11

Correlative capability of twined GBM to forecast the recovery rate of COVID-19 in Delhi

Fig. 12
figure 12

Correlative capability of twined GBM to forecast the mortality rate of COVID-19 in Delhi

Fig. 13
figure 13

Correlative capability of twined GBM to forecast the infection rate of COVID-19 in Karnataka

Fig. 14
figure 14

Correlative capability of twined GBM to forecast the recovery rate of COVID-19 in Karnataka

Fig. 15
figure 15

Correlative capability of twined GBM to forecast the mortality rate of COVID-19 in Karnataka

Fig. 16
figure 16

Correlative capability of twined GBM to forecast infection rate of COVID-19 in Kerala

Fig. 17
figure 17

Correlative capability of twined GBM to forecast recovery rate of COVID-19 in Kerala

Fig. 18
figure 18

Correlative capability of twined GBM to forecast mortality rate of COVID-19 in Kerala

Fig. 19
figure 19

Correlative capability of twined GBM to forecast infection rate of COVID-19 in Madhya Pradesh

Fig. 20
figure 20

Correlative capability of twined GBM to forecast recovery rate of COVID-19 in Madhya Pradesh

Fig. 21
figure 21

Correlative capability of twined GBM to forecast mortality rate of COVID-19 in Madhya Pradesh

5 Discussion of Results

Tree-based machine learning approaches have high accuracy in the analysis of small and big datasets in previous research studies [35, 36]. In the case of analysis of the disease data, the GBM was used to predict the association of miRNAs [35]. Besides, the improved performance of the GBM in the predictive modeling of the pandemic has been discussed [35]. This is the reason for selecting the GBM model in the prediction of the COVID-19 cases in India using the atmospheric factors and pollution levels. Due to a large geographical area, there is a huge variation in atmospheric factors (Fig. 1 and Table 1) in different states of India. Besides, the pollution levels also vary in different states, which is obvious from the variation of minimum and maximum PM10 and PM2.5 (Fig. 2 and Table 1). The basic statistics in Table 1 and Fig. 3 demonstrates the variation in the cases of COVID-19 in different states of India. The basic statistics on the atmospheric factors, pollution measures, and cases of COVID-19 suggest their unequal distribution.

The training performance results of twinned GBM for infected cases on the combined dataset of significant states of India provide R2 = 0.99, and RMSE = 834.90 with Poisson distribution, R2 = 0.97, and RMSE = 1527.28 with Gaussian distribution, R2 = 0.96, and RMSE = 1214.40 with Tweedie distribution and R2 = 0.85 and RMSE = 1239.84 with Gamma distributions. The training performance results of twinned GBM for recovered cases on the combined dataset of significant states of India provide R2 = 0.99, and RMSE = 712.99 with Poisson distribution, R2 = 0.98, and RMSE = 1244.66 with Gaussian distribution, R2 = 0.97, and RMSE = 1052.15 with Tweedie distribution and R2 = 0.81 and RMSE = 3272.82 with Gamma distributions. The training performance results of twinned GBM for mortality case on the combined dataset of significant states of India provides R2 = 0.99, and RMSE = 8.49 with Poisson distribution, R2 = 0.97, and RMSE = 14.64 with Gaussian distribution, R2 = 0.98 and RMSE = 11.55 with Tweedie distribution and R2 = 0.85 and RMSE = 38.20 with Gamma distributions. The complete performance result for infected, recovery, and mortality cases are presented in Table 3, Figs. 4, 5, and 6 respectively. The performance results of the twined GBM with all four selected four distributions (Poisson, Gaussian, Tweedie, and Gamma) are quite good and quite better it assures that there is a close correlation among the atmospheric factor, air pollutants, and COVID-19 parameters and the study may move for the further processing.

Now the trained model has applied the dataset to the seven largely affected states of India to explore the deeper analysis and correlation for testing. At first, one of the worst affected Maharashtra is taken for testing. Surprisingly the performance result of the infected case provides a very convincing correlation as R2 = 0.90, and RMSE = 5161.50 with Poisson distribution, R2 = 0.90, and RMSE = 5235.36 with Gaussian distribution, R2 = 0.88, and RMSE = 5692.20 with Tweedie distribution and R2 = 0.78 and RMSE = 7840.69 with Gamma distributions. In the case of recovery, it also approves the hypothesis with R2 = 0.87, and RMSE = 5935.77 with Poisson distribution, R2 = 0.89, and RMSE = 5432.84 with Gaussian distribution, R2 = 0.85 and RMSE = 6362.20 with Tweedie distribution and R2 = 0.71 and RMSE = 8767.26 with Gamma distributions. In the case of mortality, the performance results are also in the same hypothesis line as R2 = 0.84, and RMSE = 86.49 with Poisson distribution, R2 = 0.88, and RMSE = 75.96 with Gaussian distribution, R2 = 0.83 and RMSE = 90.38 with Tweedie distribution and R2 = 0.65 and RMSE = 130.92 with Gamma distributions. The complete performance result for Maharashtra is already shown in Table 4, Figs. 7, 8, and 9 respectively.

Secondly, the model is tested for the largely affected state of Delhi. The performance result of this testing is R2 = 0.75, and RMSE = 2664.40 with Poisson distribution, R2 = 0.78, and RMSE = 2724.86 with Gaussian distribution, R2 = 0.74, and RMSE = 2501.72 with Tweedie distribution and R2 = 0.69 and RMSE = 2957.83 with Gamma distributions. In the case of recovery, it also approves the hypothesis with R2 = 0.88, and RMSE = 5935.77 with Poisson distribution, R2 = 0.81, and RMSE = 5432.84 with Gaussian distribution, R2 = 0.85 and RMSE = 6362.20 with Tweedie distribution and R2 = 0.67 and RMSE = 8767.26 with Gamma distributions. In the case of mortality, the performance results are also in the same hypothesis line as R2 = 0.73, and RMSE = 36.07 with Poisson distribution, R2 = 0.72, and RMSE = 37.08 with Gaussian distribution, R2 = 0.69 and RMSE = 38.72 with Tweedie distribution and R2 = 0.51 and RMSE = 49.22 with Gamma distributions. The complete performance result for Maharashtra is already shown in Table 5, Figs. 10, 11, and 12 respectively.

Third, the trained model has applied the testing dataset of the significant state of Karnataka. The performance result of this testing is as R2 = 0.79, and RMSE = 4456.73 with Poisson distribution, R2 = 0.84 and RMSE = 3786.50 with Gaussian distribution, R2 = 0.74 and RMSE = 4945.09 with Tweedie distribution and R2 = 0.54 and RMSE = 6606.13 with Gamma distributions. In the case of recovery, it also approves the hypothesis with R2 = 0.55, and RMSE = 4969.39 with Poisson distribution, R2 = 0.63, and RMSE = 4463.39 with Gaussian distribution, R2 = 0.50, and RMSE = 5250 with Tweedie distribution and R2 = 0.31 and RMSE = 6143.52 with Gamma distributions. In the case of mortality, the performance results are also in the same hypothesis line as R2 = 0.64, and RMSE = 56.93 with Poisson distribution, R2 = 0.71, and RMSE = 51.71 with Gaussian distribution, R2 = 0.60, and RMSE = 60.03 with Tweedie distribution and R2 = 0.39 and RMSE = 74.71 with Gamma distributions. The complete performance result for Karnatka is already shown in Table 6, Figs. 13, 14, and 15 respectively.

Fourth, the trained model has applied the testing dataset of the significant state of Kerala. The performance result of this testing is as R2 = 0.76, and RMSE = 3982.07 with Poisson distribution, R2 = 0.76 and RMSE = 4027.15 with Gaussian distribution, R2 = 0.74, and RMSE = 4159.93 with Tweedie distribution and R2 = 0.59 and RMSE = 5251.54 with Gamma distributions. In the case of recovery, it also approves the hypothesis with R2 = 0.47, and RMSE = 5895.52 with Poisson distribution, R2 = 0.56, and RMSE = 5319.29 with Gaussian distribution, R2 = 0.43, and RMSE = 6212.97 with Tweedie distribution and R2 = 0.19 and RMSE = 6835.58 with Gamma distributions. In the case of mortality, the performance results are also in the same hypothesis line as R2 = 0.59, and RMSE = 11.37 with Poisson distribution, R2 = 0.37, and RMSE = 14.09 with Gaussian distribution, R2 = 0.58, and RMSE = 11.42with Tweedie distribution and R2 = 0.46 and RMSE = 13.08 with Gamma distributions. The complete performance result for Kerala is already shown in Table 7, Figs. 16, 17, and 18 respectively.

Fifth, the trained model has applied the testing dataset of the significant state of Madhya Pradesh. The performance result of this testing is R2 = 0.87, and RMSE = 1048.39 with Poisson distribution, R2 = 0.80, and RMSE = 1317.66 with Gaussian distribution, R2 = 0.86, and RMSE = 1109.07 with Tweedie distribution and R2 = 0.59 and RMSE = 1481.84 with Gamma distributions. In the case of recovery, it also approves the hypothesis with R2 = 0.88, and RMSE = 5895.52 with Poisson distribution, R2 = 0.81, and RMSE = 5319.29 with Gaussian distribution, R2 = 0.85, and RMSE = 6212.97 with Tweedie distribution and R2 = 0.67 and RMSE = 6835.58 with Gamma distributions. In the case of mortality, the performance results are also in the same hypothesis line as R2 = 0.84, and RMSE = 8.80 with Poisson distribution, R2 = 0.74, and RMSE = 11.29 with Gaussian distribution, R2 = 0.84, and RMSE = 8.70with Tweedie distribution and R2 = 0.65 and RMSE = 13.10 with Gamma distributions. The complete performance result for Madhya Pradesh is already shown in Table 8, Figs. 19, 20, and 21 respectively.

Sixth, the trained model has applied the testing dataset of the significant state of Uttar Pradesh. The performance result of this testing is as R2 = 0.79, and RMSE = 3552.24 with Poisson distribution, R2 = 0.80, and RMSE = 3225.86 with Gaussian distribution, R2 = 0.75, and RMSE = 3616.14 with Tweedie distribution and R2 = 0.67 and RMSE = 4131.75with Gamma distributions. In the case of recovery, it also approves the hypothesis with R2 = 0.88, and RMSE = 967.78 with Poisson distribution, R2 = 0.81, and RMSE = 1208.69 with Gaussian distribution, R2 = 0.85, and RMSE = 1068.95 with Tweedie distribution and R2 = 0.67 and RMSE = 1588.11 with Gamma distributions. In the case of mortality, the performance results are also in the same hypothesis line as R2 = 0.73, and RMSE = 36.07 with Poisson distribution, R2 = 0.72, and RMSE = 37.08 with Gaussian distribution, R2 = 0.69 and RMSE = 38.72 with Tweedie distribution and R2 = 0.51 and RMSE = 49.22 with Gamma distributions. The complete performance result for Uttar Pradesh is already shown in Table 9, Figs. 22, 23, and Figs. 24 respectively.

Fig. 22
figure 22

Correlative capability of twined GBM to forecast infection rate of COVID-19 in Uttar Pradesh

Fig. 23
figure 23

Correlative capability of twined GBM to forecast recovery rate of COVID-19 in Uttar Pradesh

Fig. 24
figure 24

Correlative capability of twined GBM to forecast mortality rate of COVID-19 in Uttar Pradesh

Seventh, the trained model has applied the testing dataset of the significant state of West Bengal. The performance result of this testing is as R2 = 0.79, and RMSE = 2012.91 with Poisson distribution, R2 = 0.78 and RMSE = 2066.74 with Gaussian distribution, R2 = 0.64 and RMSE = 2640.02 with Tweedie distribution and R2 = 0.68 and RMSE = 247,127 with Gamma distributions. In the case of recovery, it also approves the hypothesis with R2 = 0.80, and RMSE = 1723.12 with Poisson distribution, R2 = 0.72, and RMSE = 2053.87 with Gaussian distribution, R2 = 0.78, and RMSE = 1825.22 with Tweedie distribution and R2 = 0.60 and RMSE = 2475.82 with Gamma distributions. In the case of mortality, the performance results are also in the same hypothesis line as R2 = 0.78, and RMSE = 14.81 with Poisson distribution, R2 = 0.63, and RMSE = 19.17 with Gaussian distribution, R2 = 0.75 and RMSE = 15.64 with Tweedie distribution and R2 = 0.57 and RMSE = 20.60 with Gamma distributions. The complete performance result for West Bengal is already shown in Table 10, Figs. 25, 26, and 27 respectively.

Fig. 25
figure 25

Correlative capability of twined GBM to forecast infection rate of COVID-19 in West Bengal

Fig. 26
figure 26

Correlative capability of twined GBM to forecast recovery rate of COVID-19 in West Bengal

Fig. 27
figure 27

Correlative capability of twined GBM to forecast mortality rate of COVID-19 in West Bengal

The above-discussed performance parameter and the rest of the parameters are demonstrated in Tables 4, 5, 6, 7, 8, 9 and 10 and Figs. 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26 and 27 suggests that the Maharashtra had an ideal atmosphere for infection, recovery, and mortality with R2 = 0.99 in all three with the Poisson distribution. The testing model on Delhi is not so much performing on infection and recovery rate but it supports the mortality rate. The maximum performance was given by Gaussian distribution with R2 = 0.78 for the infection rate, R2 = 0.78 for recovery rate and R2 = 0.83 for the mortality rate. R2 = 0.84 for an infection rate for the Karnataka state, recovery provides R2 = 0.63 and mortality R2 = 0.71 by Gaussian distribution. Kerala infection rate R2 = 0.71 and recovery rate R2 = 0.56 provided by Gaussian distribution and mortality rate R2 = 0.59 by Poisson distribution does not support; it might lack non-arability/missing of the correct atmospheric or pollution dataset. Madhya Pradesh, maximum infection rate, recovery rate, and mortality rate R2 = 0.87, R2 = 0.88, and R2 = 0.84 respectively by Poisson distribution. Uttar Pradesh, maximum infection rate, recovery rate, and mortality rate R2 = 0.80, R2 = 0.88, and R2 = 0.73 respectively by Poisson distribution. West Bengal, maximum infection rate, recovery rate, and mortality rate R2 = 0.79, R2 = 0.80, and R2 = 0.78 respectively by Poisson distribution.

The COVID parameter according to the testing performance conclusion:

Infection Rate: Maharashtra > Madhya Pradesh > Uttar Pradesh > West Bengal > Karnataka > Delhi > Kerala.

Recovery Rate: Maharashtra > Madhya Pradesh > Uttar Pradesh > West Bengal > Karnataka > Kerala.

Mortality Rate: Maharashtra > Madhya Pradesh > Delhi > West Bengal > Uttar Pradesh > Delhi > Karnataka > Kerala.

The adverse effect of weather parameters like temperature and humidity on the cases of COVID-19 has been reported in some of the recently published research, like high spread rate at low temperature and humidity in Iran [11]; low spread rate at high humidity and temperature in China [16]; and low spread rate of high average humidity and temperature [15]. The impact of additional atmospheric factors like air pressure and wind speed are not been properly noticed in any recent studies. A positive correlation between air pollution and the cases of COVID-19 has been established in some studies, like air pollution and spread rate in Italy and China [7, 20, 21]. Moreover, the atmospheric factors and the air pollution levels are also related; therefore, the present study explored their combined effect (rate of spread) of COVID-19 in major states/places of India using the twinned GBM model. It was noticed that the states having lower mean temperature, humidity, and air pollution as Uttarakhand, Arunachal Pradesh, Himachal Pradesh, Sikkim, Mizoram, etc. have a smaller number of infected, and mortality cases and a higher number of recovered cases than other states/places with high mean temperature, humidity, and air pollution as Maharashtra, Delhi, Karnataka, Kerala, and Madhya Pradesh, etc. However, in some states, it is still difficult to understand the correlation between the spread rate of COVID-19, atmospheric factors, and air pollution measures. The collected data and the analysis outcomes of the different distribution of GBM suggest a significant correlation between the spread rate of COVID-19, atmospheric factors, and air pollution measures in most of the states of India. Besides, the high population density of some of the states and activities of people towards the government regulations, movement of migrant workers, social gatherings, etc. during the lockdown period are also some factors responsible for the spread of COVID-19.

Maharashtra, Delhi, Kerala, Karnataka, Madhya Pradesh, Uttar Pradesh, and West Bengal are worst affected states than other states of India. The predicted numbers of infected cases in Maharashtra, Madhya Pradesh, and Uttar Pradesh by different distribution of GBM are equal to their exact values for most of the day (Figs. 19, 20, 21, 22, 23 and 24). Therefore, Maharashtra was the ideal place for the spread and mortality. The missing information on the atmospheric factors, air pollution measures, and cases of COVID-19 in the duration of data collection may be one of the reasons for the average and poor forecast metrics of the different distribution of GBM for some states.

6 Conclusions and Future Research Scope

This paper presents a correlation between the atmospheric factors, air pollution measures, and infection, recovery, and mortality rate of COVID-19 in the significant states/places of India. The paper proposed a twin GBM model to capture the deep and intrinsic nature of the different datasets. The experimental results confirms that the improved GBM model is proficient enough to determine the correlation among atmospheric parameters, air pollution measures, and COVID-19 impact (infection, recovery, and mortality rate) in the aggregate dataset of different states/places of India. The enhanced performance metrics (R2 and different errors mechanism) of the improved GBM establish a convinced connotation of transmission rates of COVID-19 with air pollution measures and atmospheric factors. Particularly in some states like Maharashtra, Delhi, Karnataka, Kerala, Madhya Pradesh, Uttar Pradesh, and West Bengal where maximum number of COVID-19 cases have been reported, the air pollution measures and atmospheric factors have a significant role in the spread of the pandemic. Future research will focus on improving the state-wise prediction efficiency of COVID-19 cases by considering more parameters of the weather and atmospheric pollutants.