1 Introduction

In December 2019, SARS-CoV-2-induced Coronavirus Disease (COVID-19) disseminated to the global regions, infecting countless individuals and leading to health-related issues. On March 11, 2020, the World Health Organization (WHO) declared the illness to be an epidemic due to the virus (Organization WH 2019), which is highly contagious and pathogenic. This infectious virus severely affects an individual’s respiratory system. It became a universal pandemic because it is transmissible among humans. It spreads through microscopic liquid particles from an infected patient’s lips or nose while they cough, sneeze, speak, sing, or breathe. From large-scale respiratory drops to tiny particles, these aerosols come in a variety of sizes. The Coronavirus has been spreading rapidly in all global areas. A number of vaccines have been examined by a few drug management organizations worldwide to prevent the epidemic and to reduce the risk of contracting the disease. WHO has approved the following vaccines that have fulfilled the requirements for safety and effectiveness as of November 15, 2021: AstraZeneca/Oxford Vaccine/Covishield, Johnson & Johnson, Moderna, Pfizer, Sinopharm, Sinovac, and COVAXIN (WHO 2020). Although vaccines act as a protective barrier against symptoms, they have a few adverse effects as well. The most common ones include arm soreness, mild fever, tiredness, headaches, and muscle or joint aches.

Estimating future values is a vital part of data science and automation technologies that involve historical data to develop a model. Future values can be extrapolated using the models. The epidemiological data is collected periodically (e.g., daily, weekly, or monthly), so it is considered time-series data. It is a series of time-ordered data points associated with single or multiple time-dependent variables. By analyzing the time series, it is possible to study the nature of time-dependent data and predict its future values based on the past variability of the data. Machine learning (ML) approaches are considered effective in forecasting time series data. These approaches can assist in rapidly identifying potential cases and fatalities. Further, they can be helpful in the efficient estimation of the recorded incidences of high risk for pathogenic virus transmission and monitoring their outbreak. These algorithms can process user data regarding Coronavirus patients, giving clinicians more time and assurance while treating a critical illness (Tuli et al. 2020). It has been considered one of the most promising computing approaches, with significant potential for epidemic forecasting. Several recent studies have highlighted the tremendous potential of ML algorithms to combat pathogenic viruses (Alimadadi et al. 2020; Ardabili et al. 2020; Miralles-Pechuán et al. 2020). ML algorithms have been used efficiently for mitigation and prevention, including the identification of new pathogens, classification of novel pathogens, diagnosis, survival prediction, and intensive care unit (ICU) demand prediction (Randhawa et al. 2020; Rao and Vazquez 2020; Yan et al. 2020; Grasselli et al. 2020). Past research utilized statistical and deep learning models like vector autoregression (VAR) and long-short-term memory (LSTM) to evaluate and forecast the dynamic trajectory of the epidemic. The adaptability of analytical methodologies can be used by modeling frameworks employing machine learning or deep learning methods to forecast temporal dynamics.

A progressive model might help to estimate the disease’s probable trajectory based on features (feedback data). This data can be used to predict certain factors, such as the number of new cases and fatalities in the future, and to analyze the severity of the outbreak. One of the widely used approaches to examining COVID-19 dissemination is using the susceptible-exposed-infected-recovered (SEIR) model. It is an ordinary differential equation-based model, so it struggles to account for geographical processes and spatial heterogeneity. Despite this, human movement patterns today are more noticeable and understandable. Therefore, improbable estimations may result from failing to incorporate modern movement patterns. This study utilizes state-of-the-art deep learning and statistical models, including VAR, LSTM, and an enhanced version of the SEIR model for improving forecasting accuracy. The sequential SEIR model has been enhanced and transformed into the susceptible-exposed-infected-recovered-hospitalized-death-quarantined-vaccination (SEIR-HDQV) model for effectively estimating fatalities and incidences. These models have been employed to anticipate the recorded incidences and casualties over the next 24 h caused by the hazardous virus.

Viruses and climate variability affect marginalized communities disproportionately. A variety of factors can influence COVID-19 virus dissemination, including climate conditions. The epidemiological dynamics of this form of infectious disease may be altered by environmental exposures, such as short and long-term climatic variations. The pace of Coronavirus dissemination diverges by the country during the epidemic; one possible factor could be the weather. The incubation period of this disease, along with its spatial distribution, was found to be significantly affected by climate and weather conditions in a few studies (Mao et al. 2022; Karim and Akter 2022). There is a wide variation in temperature and humidity across different nations, so it is inappropriate to consider worldwide climate conditions to be similar everywhere. Thus, the study focuses on the meteorological conditions in the selected country during the span of Coronavirus outbreaks. As a result, meteorological data have been evaluated to assess the impact of climate on recorded incidences in India during the epidemic. The main contributions of this paper are as follows:

  • It presents a framework utilizing time series forecasting models to forecast the new incidences and fatalities due to contagious virus for the next 24 h.

  • This paper examines the impact of temperature and humidity on the dissemination of COVID-19 in India under various climate conditions.

  • It extends the compartmental SEIR model that employs an individual’s vulnerability to hazardous pathogens for effective prediction of incidences and fatalities.

  • This paper includes data on the eight countries most severely affected by the COVID-19 pandemic.

  • This paper analysis the fatalities associated with Coronavirus prior to and after the vaccine became available for the disease.

The rest of the paper is organized as follows: A literature survey is presented in Sect. 2. Section 3 explains the methodology and implementation of the forecasting models. The model evaluation is presented in Sect. 4. Section 5 provides results and discussion and analyzes the effect of climate during the outbreak of Coronavirus. Section 6 contains the conclusion and future directions.

2 Literature survey

Time-series data has been widely utilized in several application areas, including weather forecasting (Wu et al. 2020), earthquake prediction (Xue et al. 2021), signal processing (Wang et al. 2023), pattern recognition (Wu et al. 2023), and other domains. COVID-19 outbreak has been studied through several neurobiological, quantitative, and time series methods to anticipate infection incidence, fatalities, and evolution. Many unknown aspects condition the current pandemic’s expansion, including the uniquely and physiologically developed virus, human behavior, and different national policies. The studies (Wu et al. 2020; Irfan et al. 2022) highlight the effects of temperature and humidity on the spread of the pandemic across lower and higher quantiles. The influence of weather on Coronavirus has been explored in studies (Gupta et al. 2020; Mousavi et al. 2020; Singh et al. 2023; Wang et al. 2020; Auler et al. 2020), which majorly depend on aspects relating to temperature, relative humidity, wind speed, rainfall, solar irradiation, transmission rate, daily new confirmed cases, and mortality rate. It is found that the number of incidences caused by the novel Coronavirus is correlated with the humidity and temperature. In tropical nations, temperature has a minimal impact on Coronavirus case-to-mortality ratios. Based on Mohammadi et al. (2020), the association between the weather and the spread of Coronavirus has also been examined, as well as the number of fatalities in several American states of the USA. Rashed et al. (2020) have aimed to analyze the spread of the pathogenic virus using multivariate analysis based on the ambient temperature, relative humidity, and population density.

The related vaccine is one of the most effective weapons in fighting against the pathogenic virus. The vaccination creates antibodies in humans that are strong enough to stop the diseases from spreading. Ong et al. (2020) created an efficient and prominent vaccine utilizing reverse vaccinology and machine learning techniques. Reverse vaccinology (RV) attempts to find potential vaccine candidates via genetic analysis and has revolutionized vaccine development. Vaccines that comprise the complete virus can elicit immunity and defend against infections. Cotfas et al. (2021) studied the behavior of public sentiment on vaccination by looking at the time after the first vaccine declaration up to the first vaccination in the United Kingdom, throughout which civil society showed an enormous focus on the vaccination drive. Liu et al. (2021) concentrated on numerous vaccine hesitancy analyses and news reports. They presented a comparative study of three classifiers: the Naive Bayes classifier, the support vector machine (SVM), and logistic regression. SVM with term frequency-inverse document frequency (TF-IDF) and Synthetic Minority Over Sampling Technique (SMOTE) performed better among all. The accuracy of the SVM and LR for 12 classes is thoroughly stable, but the accuracy of Naive Bayes has fluctuated substantially.

Sadik et al. (2020) analyzed the different methods for forecasting the viral outbreak in Bangladesh. They used the Susceptible, Infected, and Recovered (SIR) model to predict the pandemic. The model’s outcome is inadequate for long-term prediction due to the inconsistency of the affecting factors. Furthermore, they utilized three machine learning models-Polynomial Regression (PR), LSTM, and Multilayer Perceptron-to predict the number of infections, deaths, and recoveries. Rauf et al. (2021) discussed an optimized LSTM model to forecast the pathogenic virus-confirmed cases based on mean absolute error (MAE). They compared the recurrent neural network (RNN), non-optimal LSTM, gated recurrent networks (GRU), and recent state-of-the-art algorithms. LSTM models outperformed other latest algorithms in terms of accuracy. Shastri et al. (2021) studied the optimized deep learning ensemble models to analyze the confirmation and death cases in India. Based on the reported incidences and fatalities, the mean absolute percentage error (MAPE) values are 2.40 and 1.11. Mishra et al. (2022) analyzed the amount of fatalities against the regular growth of Coronavirus-infectious individuals during the epidemic, including the days when a vaccine was available, by employing the deep learning approach. Using another machine learning approach, Agarwal and Dutta (2022) have analyzed the vaccination with a predicted mortality rate of 15.53% and a reduction in confirmed cases of 24.67%.

With the help of extreme learning machines (ELMs) and Chimp optimization algorithms, researchers Hu et al. (2021) and Cai et al. (2023) have developed a real-time COVID-19 diagnosis based on chest X-ray images. They categorize the chest X-ray images in two steps: using a deep CNN initially to extract features and then ELMs later to determine the diagnosis. Saffari et al. (2022) utilizes artificial intelligence to detect COVID-19 diseases from X-ray images. They employed the whale optimization algorithm (WOA) in a fuzzy system for training the deep convolutional neural network (DCNN). DCNN particle swarm optimization, DCNN genetic algorithm, and LeNet-5 benchmark models have been employed for better comparison. Ustebay et al. (2023) described the prognostic and diagnostic paradigms of COVID-19 to support clinical decision-making. They used eight ML algorithms and explained that the added tree and CatBoost classifiers outperformed other studied models. Subudhi et al. (2021) assess the effectiveness of eighteen ML algorithms for forecasting ICU admission and mortality across COVID-19-infected individuals. Predicting COVID-19 mortality was found to be more accurate using ensemble-based models compared with other models. Dietterich (1998) learned how to contrast supervised learning algorithms employing a statistical test. The study by Xing et al. (2022a) proposes a robust semi-supervised time-series classification (TSC) along with self-distillation, which is a hybridization of supervised, unsupervised, and self-distillation (SD) techniques. An effective federated distillation learning system (EFDLS) for multitasking TSC is presented by Xing et al. (2022b), which consists of a central server to enable numerous mobile users to carry out various TSC tasks. Xiao et al. (2021) suggested an innovative, robust temporal feature network (RTFN) with an LSTM-based attention network (LSTMaN). The RTFN-based frameworks perform better for both supervised and unsupervised learning, respectively.

Existing statistical epidemiological forecasting models, such as SIR and SEIR, use a significant feature to predict new incidences and mortality. For better prediction of a pandemic, additional factors like hospitality, quarantined, symptomatic and asymptomatic incidences, etc., may be included as features. According to the specified literature survey, deep learning models are the most effective at forecasting epidemiology. An updated feature set can further refine them for more accurate forecasting. Moreover, because the countries cover a large geographical area, the temperature and humidity features vary depending on location, and no temperature index represents the entire country. Studies examining the climate impact of COVID-19 recorded incidences used only one region, so they might not account for the variations between locations in temperatures and humidity levels.

Fig. 1
figure 1

Proposed methodology

3 Methodology

The studies were carried out in several steps that involved short-term forecasts of recorded incidences (cases) and fatalities, the impact of vaccination on mortality, and climate effects on the spread of the pathogenic virus. The OWID-COVID dataset contains information on the COVID-19 epidemic, including incidences, fatalities, hospitalizations, vaccinations, and so forth. The data is recorded at regular intervals (day basis), making it a time series. This time series dataset is applied for forecasting new cases, fatalities, and vaccination impacts.

The proposed methodology workflow is depicted in Fig. 1. A time series must first be transformed into a supervised ML problem to be estimated. In the ML approach, raw data must be pre-processed before being used for model training. During the pre-processing step, missing values are eliminated and regular observations are arranged into a single vector. Subsequently, an overview of the correlation matrix analysis and feature selection process has been performed as a component of the pre-processing step to prepare the data for epidemic training. Principal component analysis (PCA) is applied to the dataset to map the features on the lower dimension, while normalization and smoothing are employed to transform the data. The time series data is shifted, adopting the appropriate lag values, to make the time series dataset acceptable for forecasting through supervised learning algorithms. For model training and testing, the dataset is used in a ratio of 75% and 25%, respectively. In the statistical and deep learning categories, VAR and LSTM are regarded as state-of-the-art time series forecasting models and are utilized for effective time series forecasting. The compartmental SEIR model has been enhanced and transformed into the SEIR-HDQV model for efficient Coronavirus outbreak prediction. Multiple evaluation parameters are employed to evaluate the forecasting models against the testing data, including root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). Further, the OWID-COVID dataset is mixed with the climate dataset to analyze the temperature and humidity effects on the spread of the Coronavirus due to climate change. The subsequent sections provide a detailed explanation of the adopted methodology.

3.1 Pre-processing

Trend/Seasonality Removal:The trend is part of a time series that depicts low-frequency fluctuations after high and medium-frequency variations (Maurya and Singh 2020). Seasonality represents the time series property during the epidemic in which the data generates predictable and recurring variations throughout the outbreak. A differencing technique can be applied to organize trends and seasonality in a time series. By inserting the lag \({\mathcal {H}}\) difference operator \(\nabla _ {\mathcal {H}}\), non-seasonal data can also interact with the seasonality of duration \({\mathcal {H}}\), which is defined as:

$$\begin{aligned} \nabla _{\mathcal {H}} {\mathcal {Y}}_t = {\mathcal {Y}}_t - {\mathcal {Y}}_{t-h} \end{aligned}$$

The operator \(\nabla _ {\mathcal {H}}\) is applying to the model,

$$\begin{aligned} {\mathcal {Y}}_t = {\mathcal {N}}_t + {\mathcal {R}}_t + {\mathcal {X}}_t \end{aligned}$$

where, \({\mathcal {R}}_t\) has period \({\mathcal {H}}\), we get the equation:

$$\begin{aligned} \nabla _ {\mathcal {H}} {\mathcal {Y}}_t = {\mathcal {N}}_t - {\mathcal {N}}_{t- {\mathcal {H}}} + {\mathcal {X}}_t - {\mathcal {X}}_{t-{\mathcal {H}}} \end{aligned}$$

which gives a decomposition of the difference \(\nabla _ {\mathcal {H}} {\mathcal {Y}}_t\) into a trend component \(({\mathcal {N}}_t - {\mathcal {N}}_{t-{\mathcal {H}}})\), \(({\mathcal {X}}_t - {\mathcal {X}}_{t- {\mathcal {H}}})\) is a noise term and removes the seasonality.

3.2 Feature selection

Irrelevant features of a dataset should be evicted to minimize the computational complexity of modeling, which may enhance the model’s performance. Domain knowledge might be used to eliminate irrelevant features from a dataset, which can then be further reduced by using dimensionality reduction techniques. PCA (Wold et al. 1987) is an efficient method for dimensionality reduction. This algorithm converts dataset features into principal components with linearly uncorrelated characteristics. It uses eigenvalues for compressing the dimension of instances while preserving crucial data (Johnstone 2001). The orthogonal linear transformation converts the features to a new coordinate system in such a way that the highest variance in the given coordinate system falls on the first component (first principal component) and still falls on the second component (second principal component), the scale of the variables affects the output of PCA. Multiple steps can be taken to evaluate the algorithm:

Step 1- Ensure that the range of continuous initial variables is through standardization.

$$\begin{aligned} {\mathcal {S}}& = \frac{Feature \, Value \mathcal {(F)} - Mean \mathcal {(M)} }{Standard\, Deviation (\sigma )} \end{aligned}$$

Step 2 - Determine the correlations by calculating the covariance matrix.

$$\begin{aligned}{} & {} \begin{pmatrix} Var(Y_1) &{} \hdots \hdots &{} Cov(Y_1, Y_p)\\ &{} &{} &{} \\ \vdots &{} \ddots &{}\vdots \\ &{} &{} &{} \\ Cov(Y_p,Y_1)&{} \hdots \hdots &{} Var(Y_p) \end{pmatrix} \\{} & {} Var(Y_i) = \sum _{k=1}^{p} \sum _{l=1}^{p} e_{ik} e_{il} \sigma _{kl}\\{} & {} Cov(Y_{i-1}, Y_i) = \frac{1}{{\mathcal {N}}} \sum _{i=1}^{p} (Y_{i-1} - {\mathcal {M}}) (Y_i - {\mathcal {M}}) \end{aligned}$$

Here, \(Y_i\) is the random function (a predicted form of \((X_1, X_2,\ldots,X_p)\)) of the OWID-COVID instances, and \(e_{ip}\) is the regression coefficient, whereas \({\mathcal {M}}\) is the mean of the instances. \({\mathcal {N}}\) represents the total number of instances of the dataset.

Step 3 - Compute the eigenvalues and eigenvectors of the given covariance matrix to identify the principal components of the Coronavirus instances.

$$\begin{aligned} Cov(Y_i, Y_j) = \sum _{k=1}^{p} \sum _{l=1}^{p} e_{ik} e_{jl} \sigma _{kl} \end{aligned}$$

Step 4 - Formulate a feature vector to decide which principal components to keep.

Fig. 2
figure 2

Feature analysis of OWID-COVID dataset using PCA

$$\begin{aligned} Cov(Y_{i-1},Y_i) = \sum _{k=1}^{p} \sum _{l=1}^{p} e_{(i-1)k} e_{il} \sigma _{kl} \end{aligned}$$

Figure 2 represents the PCA analysis of the OWID-COVID dataset. Based on the domain knowledge, fourteen features are considered for further experiments. Additionally, PCA is applied to the selected features, and the rigorous experiment depicts that the five principal components are sufficient to make the classes distinct and separable. In the first principal component, the variance of the new vaccinations (vaccinations), the total vaccinations, the vaccinated, and the fully vaccinated is higher. It shows a positive correlation between total deaths, ICU patients, and hosp patients, while the positive rate, excess mortality, and cardiovascular death rate have low variances. The second principal component has a higher variance among the aged 65 older and those aged 70 older. It shows that total vaccinations, people vaccinated, and people fully vaccinated are positively correlated, while excess mortality and cardiovascular death rate are highly negatively correlated.

A correlation heatmap illustrates the relationships between variables by visualizing a correlation matrix. The correlation between the variables on each axis can be seen in each square box and varies from -1 to 1. The entire dataset is applied to see whether the data are related. By establishing a heatmap portraying the distribution of various factors (such as diabetes, new cases, new fatalities, and so on) around the world, the correlation coefficient has been ascertained. In Fig. 3, a perfect positive association is observed: new cases per week are strongly associated with new deaths and positively correlated with new cases. Although the number of newly reported cases per million is negatively associated with overall fatalities, it is inversely related to new deaths and the number of deaths per week.

Fig. 3
figure 3

Correlation matrix among features

3.3 Forecasting models

A time series \(\{Y_t \vert t \in T\}\) can be used to examine the outbreak for the OWID-COVID instances collected over a time interval T for the set of Coronavirus instances ordered through time. T stands for the index set of the dataset, which is distinct and evenly separated in time. The random variable \(Y_t\) represents the dataset features at any time t. Let \(i\in {\mathbb {N}}\), \(T \subseteq {\mathbb {R}}\), a function \(y: T \rightarrow {\mathbb {R}}^{i}\), \(t \xrightarrow {y_t} {\mathbb {R}}^{i}\) or, similarly, a set of indexed elements of \({\mathbb {R}}^{i}\),

$$\begin{aligned} \{y_{t} \vert y_{t} \in {\mathbb {R}}^i, t \in T\} \end{aligned}$$

is an observed time series. The mean function of the features is defined as:

$$\begin{aligned} \mu _{t} = E[Y_t], \forall t \in T. \end{aligned}$$

For a time series process \((Y_t)\), the variance function of the features is defined as if \(\forall t \in T\):

$$\begin{aligned} \sigma _{t}^2 = Var [Y_t] = E[Y_{t}^2] - [E[Y_t]]^2, \forall t \in T \end{aligned}$$

For OWID-COVID instances, we assume that the mean and variance are constant. Therefore, the estimates are:

$$\begin{aligned} {\hat{\mu }}& = \frac{1}{n} \sum _{t=1}^{n} Y_t, \\ {{\hat{\sigma }}}^2& = \frac{1}{n-1} \sum _{t=1}^{n} (Y_t - \mu )^2 \end{aligned}$$

The covariance and correlation functions define the level of dependency between the two features (random variables) \(X_p\) and \(X_q\) to the dataset. Let \(\gamma _{p,q}\), and \(\rho _{p,q}\) be the auto-covariance function (ACVF) and auto-correlated function (ACF) of the dataset, then the time series of the features \(\{X_p, X_q \vert p, q \in T\}\) is defined as:

$$\begin{aligned} \gamma _{p,q}&= Cov[X_p,X_q] = E[(X_p- E[X_p])(X_q - E[X_q])]\\&= E[X_p X_q] - E[X_p]E[X_q] \\ \rho _{p,q}&=Corr[X_p,X_q] = \frac{Cov[X_p,X_q]}{\sqrt{Var[X_p]Var[X_q]}} \end{aligned}$$

For any two sets of features \((r_1, r_2,\ldots, r_n)\) and \((s_1,s_2,\ldots,s_n)\) of a dataset, where n represents the number of instances, the sample covariance and correlation functions are given as:

$$\begin{aligned} {{\hat{\gamma }}}_{r,s}&= \frac{1}{n-1} \sum _{t=1}^{n} (r_t - \bar{r})(s_t - \bar{s}) \end{aligned}$$
(1)
$$\begin{aligned} {{\hat{\rho }}}_{r,s}&= \frac{\sum _{t=1}^{n} (r_t - \bar{r})(s_t - \bar{s})}{\sqrt{ \sum _{t=1}^{n}{(r_t -\bar{r})}^2 \sum _{t=1}^{n}{(s_t - \bar{s})}^2}} \end{aligned}$$
(2)

where \({{\hat{\rho }}}_{r,s}\) is the ACF of the stochastic process of the instances; for time series instances, the ACVF and ACF measure the covariance/correlation between the single time series instances \((r_1, r_2,\ldots,r_n)\) and themselves at different lags. By using Eqs. 1 and 2 at lag 0, the \({{\hat{\gamma }}}_{0}\), is the covariance of \((r_1, r_2,\ldots, r_n)\) with itself, then the ACVF of the Coronavirus instances is:

$$\begin{aligned} {{\hat{\gamma }}}_{0}& = \frac{1}{n-1} \sum _{t=1}^{n}(r_t - \bar{r})(r_t - \bar{r}) \\ {{\hat{\gamma }}}_{0}& = \frac{1}{n-1} \sum _{t=1}^{n}(r_t - \bar{r})^2 \end{aligned}$$

Likewise, \({{\hat{\rho }}}_{0}\) be the ACF of the OWID-COVID instances, the correlation lies then,

$$\begin{aligned} {{\hat{\rho }}}_{0}&= \frac{\sum _{t=1}^{n} (r_t - \bar{r})(r_t - \bar{r})}{\sqrt{ \sum _{t=1}^{n}{(r_t -\bar{r})}^2 \sum _{t=1}^{n}{(r_t - \bar{r})}^2}}\\ {{\hat{\rho }}}_{0}&= \frac{\sum _{t=1}^{n} (r_t - \bar{r})(r_t - \bar{r})}{ \sum _{t=1}^{n}{(r_t -\bar{r})} \sum _{t=1}^{n}{(r_t - \bar{r})}} = 1 \end{aligned}$$

For OWID-COVID instances, an autocorrelation function (ACF) or partial autocorrelation function (PACF) can be employed to aggregate the lag values between any two features. PACF and ACF are essential characteristics of the stochastic process \(\{Y_t \vert t \in T\}\)Analysis (2020). Let \(Y_t\) be the stationary time series and \(Y_{t-h}\) with the lagged value of h during the Coronavirus outbreak. PACF estimates the degree of correlation between any two instances of the dataset \(Y_t\) and \(Y_{t-h}\) but ignores the other time lags. The variance between x and \(y_3\) can be calculated using the variables \(y_1\) and \(y_2\) as follows:

$$\begin{aligned} \frac{Cov(x,y_3\vert y_1, y_2)}{\sqrt{Var(x \vert y_1,y_2) Var(y_3 \vert y_1, y_2)}} \end{aligned}$$

where \(y_1, y_2,\) and \(y_3\) be the regression coefficients and x is the response variable. A partial correlation exists between x and \(y_3\), describing their association with \(y_1\) and \(y_2\) and indicating how reliant on one another they are. We define first-order partial auto-correlation as being equal to first-order auto-correlation. For lag 2, PACF between two features is defined as follows:

$$\begin{aligned} \frac{Cov(y_t,y_{t-2} \vert y_{t-1})}{\sqrt{Var(x_t \vert x_{t-1})Var(x_{t-2} \vert x_{t-1})}} \end{aligned}$$
Fig. 4
figure 4

Correlated values of a new cases and b new deaths

Figure 4 shows the autocorrelation between the correlated and lag values on the daily reported incidences and fatalities. Both Fig. 4a and b display a strong correlation between the current instance and the lag value of the historical thirty instances, then it can be used as a lag value for further analysis.

3.3.1 Stationary test

A time series is referred to as stationary if there is no trend or seasonal effect; therefore, summary statistics such as mean and variance tend to remain constant over time, thus making them easier to predict. A time series \(\{X_t \vert t \in T\}\) is said to be strictly stationary or strongly stationary if the distributions of its instances \((X_{t_1},\ldots, X_{t_n})\) and \((X_{t_1 + s},\ldots, X_{t_n +s})\) are the same \(\forall n\) and \(t_1, t_2,\ldots,t_n, s \in T\). A time series \(\{X_t \vert t \in T\}\) is said to be weakly stationary or covariance stationary, or second-order stationary if: the mean function of the time series is constant and finite; \(\mu _X (t) = \mu < \infty\), \(\forall t\in T.\) In the time series, the variance function is constant and finite: \(Var(X_t) < \infty\), \(\forall t \in T.\) ACVF and ACF both depend on the lag value. The concepts of the ACVF and ACF are as follows:

$$\begin{aligned} \gamma _{t, t+ \alpha }& = Cov [X_t, X_{t+\alpha }]= \gamma _{\alpha }, \forall t,t+\alpha , \alpha \in T \\ \rho _{t, t+ \alpha }& = Corr [X_t, X_{t+\alpha }] = \rho _{\alpha }, \forall t, t+\alpha , \alpha \in T \end{aligned}$$

An Augmented Dickey-Fuller (ADF) test is performed under the null hypothesis that a unit root exists in a time series sampled (Cheung and Lai 1995). It is employed to determine whether or not a time series sample is a random walk.

$$\begin{aligned} \Delta y_{t } = y_{t } - y_{t-1} = \alpha +\beta t+\gamma y_{t-1} + \epsilon _{t} \end{aligned}$$

where \(\alpha\) is a constant and \(\beta\) is the time trend coefficient. \(y_{t-1}\) represents the value of a time series at lag order 1, \(\epsilon _t\) represents the error term of the function, and \(\gamma = 0\) indicates a random walk (non-stationary series). An ADF test incorporates higher-order regressive processes of the type \(\Delta {\mathcal {Y}}_{t-p}\) where \(1\le t\).

$$\begin{aligned} \begin{aligned} \Delta y_{t }&=\alpha +\beta t+\gamma y_{t-1} +\delta _{1}\Delta {\mathcal {Y}}_{t-1} + \delta _{2}\Delta {\mathcal {Y}}_{t-2} + \cdots\\&\quad + \delta _{p}\Delta {\mathcal {Y}}_{t-p} + \epsilon _{t} \end{aligned} \end{aligned}$$

As of time \((t-1)\), \(\Delta {\mathcal {Y}}_{t-1}\) is equal to the first order difference in the series, and \((\delta _1, \delta _2,\ldots, \delta _p)\) is the coefficient of the \((\Delta {\mathcal {Y}}_{1}, \Delta {\mathcal {Y}}_2,\ldots, \Delta {\mathcal {Y}}_p)\). An ADF test involves testing a hypothesis (including the null and alternate hypothesis) by computing the test statistics and reporting the p-value. The p-value also known as the probability, is a measure of how likely the null hypothesis is to hold. If the p-value of the ADF test is less than or equal to 0.05, then the null hypothesis is rejected, and the series is determined to be stationary. When the ADF test is applied to the OWID-COVID dataset, the p-value is reported to be greater than 0.05, which indicates a non-stationary time series. First-order differencing is employed to make this series stationary, which results in a p-value below 0.05.

3.3.2 SEIR model

The time-dependent suscepted-exposed-infected-recovered (SEIR) model (Ghostine et al. 2021) is an epidemiological model that separates the overall population into four categories to predict epidemic outbreaks. The simplest model can be used to study the forecast of the recorded incidences and fatalities due to the spread of the virus. Susceptible\(({\mathcal {S}})\), exposed(E), infected\(({\mathcal {I}})\), and recovered\(({\mathcal {R}})\) are the four sections of the time-dependent mathematical model. An individual is a member of an infected class that can infect others.

Let \({\mathcal {S}}(t)\), E(t), \({\mathcal {I}}(t)\), and \({\mathcal {R}}(t)\) be the fraction of the population for four groups at a time t (Kanpur 2020).

$$\begin{aligned} {\mathcal {S}}(t) + E(t) + {\mathcal {I}}(t) + {\mathcal {R}}(t) = 1 \end{aligned}$$
(3)

On differentiating the above equation with respect to time t, we get,

$$\begin{aligned} \frac{d{\mathcal {S}}}{dt} + \frac{dE}{dt} + \frac{d{\mathcal {I}}}{dt} + \frac{d{\mathcal {R}}}{dt} = 0 \end{aligned}$$
(4)

The fraction of infected individuals in a single day is:

$$\begin{aligned} \frac{d{\mathcal {S}}}{dt} = -\Psi {\mathcal {S}}{\mathcal {I}} \end{aligned}$$
(5)

The interaction between infected and susceptible individuals is represented by \(\Psi\). Therefore, recovered individuals are directly proportional to infected individuals.

$$\begin{aligned} \frac{d{\mathcal {R}}}{dt} = \varkappa {\mathcal {I}} \end{aligned}$$
(6)

Here \(\varkappa\) is the proportional constant. From Eq. 4, we get,

$$\begin{aligned} \frac{d{\mathcal {S}}}{dt} + \frac{dE}{dt} + \frac{d{\mathcal {I}}}{dt} + \frac{d{\mathcal {R}}}{dt} = 0 \\ \frac{d{\mathcal {I}}}{dt} = 0 \end{aligned}$$

Since there is no spreading to others at time t, \({\mathcal {I}}(t)\) becomes zero. Putting the above values in Eq. 4, we get,

$$\begin{aligned}{} & {} -\Psi {\mathcal {S}}{\mathcal {I}} + \frac{dE}{dt} + \varphi E + 0 = 0\nonumber \\{} & {} \frac{dE}{dt} = \Psi {\mathcal {S}}{\mathcal {I}} - \varphi E \end{aligned}$$
(7)

where \(\varphi\) emphasizes the association between exposed and infected individuals. Putting all these values in Eq. 4, we get,

$$\begin{aligned} -\Psi {\mathcal {S}}{\mathcal {I}} + \Psi {\mathcal {S}}{\mathcal {I}} - \varphi E + \frac{d{\mathcal {I}}}{dt} + \varkappa {\mathcal {I}} = 0 \nonumber \\ \frac{d{\mathcal {I}}}{dt} = \varphi E - \varkappa {\mathcal {I}} \end{aligned}$$
(8)

The Eqs. 5, 6, 7, and 8 depict the rate of change of susceptible individuals, recovered individuals, exposed individuals, and infected individuals in the overall population.

Fig. 5
figure 5

Transmission flow of the SEIR-HDQV model

3.3.3 SEIR-HDQV model

We extend the SEIR epidemiological model with nine additional phenotypes to simulate the outbreak (Vrabac et al. 2021). Figure 5 illustrates the SEIR-HDQV model transmission flow of individuals through the pathogenic virus. Based on Fig. 5, these stages are reflected to perceive an infected case’s entire life cycle: prior infection, throughout infection, and then after discharge, that is, either recovered or deceased. Consequently, every stage in this model is designed to describe the behavior of a specific population on a given day at a given time. Consider the following factors: \({\mathcal {S}}(t)\), \({\mathcal {V}}(t)\), E(t), \({\mathcal {I}}^{sym}(t)\), \({\mathcal {I}}^{asym}(t)\), \({\mathcal {Q}}(t)\), \({\mathcal {H}}(t)\), \({\mathcal {R}}(t)\), and \({\mathcal {D}}(t)\) denote the number of susceptible (susceptible unvaccinated), susceptible vaccinated (V(t)), exposed persons, symptomatic persons, asymptomatic persons, quarantined, hospitalized, recovered, and died at a time t, respectively. We describe the overall population, represented by \({\mathcal {N}}\), as \({\mathcal {N}}\) = \({\mathcal {S}}(t)\) + E(t) + \({\mathcal {I}}^{sym}(t) + {\mathcal {I}}^{asym}(t)\) + \({\mathcal {Q}}(t) + {\mathcal {H}}(t) + {\mathcal {R}}(t) + {\mathcal {V}}(t) + {\mathcal {D}}(t)\) based on the state definitions stated before at a time t. The nonlinear differential equations are evaluated below, based on the transmission model from Fig. 5:

$$\begin{aligned} \frac{d{\mathcal {S}}}{dt}&= -\frac{\Theta }{{\mathcal {N}}} [{\mathcal {I}}^{sym}(t) + {\mathcal {I}}^{asym} (t)]{\mathcal {S}}(t) - \vartheta {\mathcal {S}}(t) \\ \frac{dE}{dt}&=\frac{\Theta }{{\mathcal {N}}} [{\mathcal {I}}^{sym}(t) + {\mathcal {I}}^{asym}(t)]{\mathcal {S}}(t) + \varsigma \Theta [{\mathcal {I}}^{sym}(t) \\&\quad + {\mathcal {I}}^{asym}(t)]{\mathcal {V}}(t) - \delta \xi E(t) - (1-\delta )\lambda E(t) \\ \frac{d{\mathcal {I}}^{sym}}{dt}&= \delta \xi E(t) - \eta {\mathcal {I}}^{sym}(t) - \kappa {\mathcal {I}}^{sym}(t)\\ \frac{d{\mathcal {I}}^{asym}}{dt}&= (1- \delta ) \lambda E(t) - \phi {\mathcal {I}}^{asym}(t)\\ \frac{d{\mathcal {Q}}}{dt}&= \phi {\mathcal {I}}^{asym}(t) + \kappa {\mathcal {I}}^{sym}(t) - \omega {\mathcal {Q}}(t) - \Omega {\mathcal {Q}}(t)\\ \frac{d{\mathcal {H}}}{dt}&= \eta {\mathcal {I}}^{sym}(t) - \tau \rho {\mathcal {H}}(t) - (1-\tau ) \Pi {\mathcal {H}}(t)\\ \frac{d{\mathcal {R}}}{dt}&= (1- \tau )\Pi {\mathcal {H}}(t) - \Omega {\mathcal {R}}(t) + \omega {\mathcal {Q}}(t)\\ \frac{d{\mathcal {D}}}{dt}&= \tau \rho {\mathcal {H}}(t)\\ \frac{d{\mathcal {V}}}{dt}&= \vartheta {\mathcal {S}}(t) - \varsigma \Theta [{\mathcal {I}}^{sym} + {\mathcal {I}}^{asym}]{\mathcal {V}}(t) - \Omega {\mathcal {V}}(t) \end{aligned}$$

with non-negative initial conditions, \({\mathcal {S}}(0)\ge 0\), \(E(0)\ge 0\), \({\mathcal {I}}^{sym}(0)\ge 0\), \({\mathcal {I}}^{asym}(0)\ge 0\), \({\mathcal {Q}}(0)\ge 0\), \({\mathcal {H}}(0)\ge 0\), \({\mathcal {R}}(0)\ge 0\), \({\mathcal {D}}(0)\ge 0\), and \({\mathcal {V}}(0)\ge 0\). The coefficients are: \(\vartheta\) is the rate of interaction between the susceptible individuals with vaccinated individuals, \(\Theta\) is the rate of contact persons between susceptible human beings to exposed human beings, \(\varsigma\) is the rate of connected persons between vaccinated individuals to exposed individuals, \(\delta\) is the rate of the symptomatic individuals connect with \(\xi\) exposed individuals, and \((1-\delta )\) is the asymptomatic individuals interact with exposed individuals, \(\eta\) is the rate of symptomatic infected individuals whose goes to hospitals, \(\phi\) is the rate of asymptomatic infected individuals, that individuals are going to quarantine, \(\kappa\) is the rate of symptomatic individuals that are going to quarantine, \(\omega\) be the quarantined individuals, which are going to the recovered individuals, \(\Omega\) is the natural death of the human beings, \(\tau\) \((0\le \tau \le 1)\) is the rate of admitted in the hospital and going to the death \(\rho\), and \((1-\tau )\) is the rate of admitted persons in the hospital which are going to the recovered \(\Pi\).

3.3.4 Vector auto regression (VAR)

A VAR model is a technique for modeling dynamics among a set of k-variables (Brandt and Williams 2007), also called endogenous variables, over time. The variables are organized in a vector with k dimensions, \(Y_t\), whose length is k. This method focuses on the dynamics of various time series and often employs multivariate and multiple regression techniques. When at least two or more time series instances are interdependent while the enclosed time arrangements are bi-directional, the VAR approach can serve as a prediction computation. The cumulative incidents and fatalities in the focused countries may be estimated using this model. COVID-19 is distinguished by an increase in new incidences that are positively related to fatalities. The mortality rate increases in tandem with the number of new cases. It is a more effective forecasting paradigm that can be produced using the VAR process, which enables the integration of both the number of newly diagnosed incidences and fatalities into an integrated framework. A \(p^{th}\)-order VAR contains lags over the most recent p-periods. VAR(p) is an abbreviation for a \(p^{th}\)-order VAR also expressed as a VAR with p-lags. Let \(Y_t = \begin{bmatrix} Y_{t,1} \\ Y_{t,2} \\ \vdots \\ Y_{t,k} \end{bmatrix}\) represent the vector-valued time series that consists of k-individual time series. In this case, we assume that \(Y_t\) is stationary, which means that the cross-covariance function \(Cov(Y_{t, i}, Y_{r,j})\) depends only on \((r-t)\). The \(p^{th}\)-order VAR model can be stated as follows:

$$\begin{aligned} Y_t= \beta _1 Y_{t-1} + \beta _2 Y_{t-2} +\cdots+\beta _p Y_{t-p} +\epsilon _t \end{aligned}$$

where the constant terms \((\beta _1, \beta _2,\ldots, \beta _p)\) are the coefficients of the lags Y till order p, and the error term is \(\epsilon _t\) whose dimension is k. For each i of \((i = 0, 1,\ldots, k)\), \(Y_i\) represents the time-invariant matrix with the dimension \((k \times k)\). Error terms \(\epsilon _t\) must satisfy three cases:

\(E(\epsilon _t) = 0\), where zero is the mean of each error term. \(E(\epsilon _t) = {\mathscr {K}}\), ensuring that the error term of the covariance matrix should be a positive-semi-definite matrix \(k \times k\) denoted by \({\mathscr {K}}\). \(E(\epsilon '_{t-k}) = 0\), make sure that the error terms do not have cross-temporal or linear correlations for any non-zero k.

A time series vector can be defined using the VAR(p) technique for short-term forecasting:

$$\begin{aligned} \begin{bmatrix} Y_t \\ G_t \\ \end{bmatrix} = \beta _1 \begin{bmatrix} Y_{t-1} \\ G_{t-1} \\ \end{bmatrix} + \beta _2 \begin{bmatrix} Y_{t-2}\\ G_{t-2}\\ \end{bmatrix} +\hdots + \beta _{p} \begin{bmatrix} Y_{t-p}\\ G_{t-p}\\ \end{bmatrix} + \begin{bmatrix} \epsilon _{t,1}\\ \epsilon _{t,2}\\ \end{bmatrix} \end{aligned}$$

In this scenario, several new incidences and fatalities are listed as \(Y_t\) and \(G_t\), respectively. A maximum likelihood estimation is employed to estimate the coefficient of the matrix \(\beta _{j} = \begin{bmatrix} \beta _{11} + \beta _{12} \\ \beta _{21} + \beta _{22}\\ \end{bmatrix}\).

3.3.5 Long short term memory (LSTM)

The most effective component of the LSTM model is that it delineates and maintains an intrinsic memory cell status over the whole life cycle of individuals in order to create temporal correlations. LSTM is a form of artificial neural network (ANN) that is particularly good at solving regression and classification problems. It is a part of the recurrent neural network (RNN) that can handle long-term dependence, as represented in Fig. 6. The LSTM network is an enhanced version of RNN (sequential network) that allows data to persist. As seen in Fig. 6, LSTM cells consist of three sections called gates. The first section of the cell is known as the forget gate; the second part is familiar as the input gate; and the remaining part is known as the output gate (Graves et al. 2005). These three gates pass the information into and out of the memory cell, and the memory cell stores values across arbitrarily chosen time intervals. In a time-series domain such as estimating Coronavirus dissemination, at time \(t = 1\) to N, it produces an output series \(h = (h_1, h_2,\ldots, h_N)\) mathematically expressed as Hochreiter and Schmidhuber (1997), for a given set of input series \(y = (y_1, y_2,\ldots, y_N)\).

Fig. 6
figure 6

Architecture of LSTM

The LSTM cell has three gates of the same shape, which are determined as follows:

$$\begin{aligned} f_t&= \sigma _g (W_{fy} * x_t + V_f* h_{t-1}+ k_f)\\ i_t&= \sigma _g (W_{iy} * x_t + V_i * h_{t-1} + k_i)\\ o_t&= \sigma _g (W_{oy} * x_t + V_o * h_{t-1} + k_o) \end{aligned}$$

The above three gates have a sigmoid activation function that creates smooth curves in the interval between 0 and 1. Here, tanh is the activation function and ranges from [-1, 1]. The next step is to transmit new information to the cell state via the feature of input x at moment t and a hidden state at the timing \((t-1)\).

$$\begin{aligned} c'_t&= \tanh (W_{cy} * x_t + V_c * h_{t-1} + k_c)\\ c_t&= f_t * c_{t-1} + i_t * c'_t \end{aligned}$$

The current cell output \(h_t\) of the LSTM cell is defined by:

$$\begin{aligned} h_t = o_t * \tanh (c_t) \end{aligned}$$

where \(f_t\), \(i_t\), \(o_t\), \(c_t\), \(h_t\), \(\sigma _g\), \(x_t\), and \(h_{t-1}\) are known as the forget gate, input gate, output gate, memory cell, hidden state, sigmoid function, input to the current timestamp, and hidden state of the previous timestamp, respectively. The \(c'_t\) is the internal consumption of the LSTM model and is used to generate \(h_t\) and \(c_t\). The weights \(W_{fy}, W_{iy}, W_{oy}, W_{cy}\) are associated with the inputs, \(V_f, V_i, V_o, V_c\) are the weight matrices with the hidden state. The \(k_f, k_i, k_o, k_c\) are biased functions of the model. The weight matrices and biases are not time-dependent. In this case, LSTM is implemented to detect the dissemination of a pathogenic virus while accounting for uncertainties. The parameter tuning process of the LSTM architecture has been discussed in detail in the Sect. 5.

Table 1 Reported incidences for the next 24 h based on performance metrics for different countries

4 Evaluation metrics

The root mean square error (RMSE) and mean absolute error (MAE) values are employed to evaluate the effectiveness of the models. Performance is optimal for estimated models with the lowest RMSE and MAE. The mathematical formulation of this performance evaluator is as follows:

$$\begin{aligned} RMSE&= \sqrt{ \frac{1}{n}{\sum _{i=1}^n (Y_{pre} - Y_{act})^2}}\\ MAE&= \frac{1}{n}{\sum _{i=1}^n abs(Y_{pre} - Y_{act})} \end{aligned}$$

where n is the number of observations, \(Y_{pre}\) represents the predicted values, while \(Y_{act}\) represents the actual values.

The mean absolute percentage error (MAPE) has been computed by adding percentage errors without regard to sign. It expresses the error as a percentage. Furthermore, the problem of positive and negative inaccuracies canceling out has been avoided because absolute percentage errors are employed.

$$\begin{aligned} MAPE = \frac{100\%}{n}\sum _{t= 1}^{n} \left\vert {\frac{O_t - F_t}{O_t} } \right\vert \end{aligned}$$

where n is the number of observations. The actual and forecasted values of the models are \(O_t\) and \(F_t\), respectively.

5 Results and discussion

The experiments for this paper have been performed using Google Collaboratory, which uses Python 3.7 and offers a single GPU cluster with an NVIDIA K80 GPU, 12 GB of RAM, and a clock speed of 0.82 GHz. The estimation techniques for the outbreak are applied using the methodology discussed in Sect. 3. The estimation of daily reported cases and fatalities could assist in real-time strategic planning. The OWID-COVID dataset is utilized to carry out the experiments, which contain the information of 163 nations (Cameron Appel and Beltekian 2019). For this study, the top eight nations have been selected based on their Worldometer rankings for maximum fatalities (Worldometer 2020). It is an open-source repository having the most recent information, such as newly reported cases, mortality, vaccination, hospitalization, and other attributes of COVID-19 vaccination progress worldwide. It has approximately 0.3 million instances with 67 different features. Data processing must be performed to ensure data integrity requirements and transform nominal data into numeric data. Further, the impact of climatic factors on the dissemination of pathogenic viruses has also been studied in a case study in India. The study utilizes meteorological data associated with temperature and relative humidity gathered from the Central Pollution Control Board (CPCB) (Room 2022).

Fig. 7
figure 7

The tensorboard log file output for hyperparameter tuning through grid search

This section addresses the individual impacts of COVID-19 along with the consequences of vaccination when it begins. All forecasting models-LSTM, VAR, and SEIR-HDQV-learn from historical data to predict recorded infected incidences and fatalities for the next 24 h. The dataset has been processed in a ratio of 75%: 25% for training and testing purposes. Among all the models, LSTM requires the optimal set of hyperparameters for effective forecasting. Extensive experiments are applied to find an optimal collection of hyperparameters utilizing grid search; the outcome of this experiment can be seen in Fig. 7. The consolidation of the parameters that led to the lowest MAE on the training dataset is highlighted by the green line. The final model was trained using these hyperparameters (epochs: 50 with early stopping; activation function: relu; optimizer: adam with learning rate 0.01; loss function: MSE; ) on the training data.

Fig. 8
figure 8

Forecasting of new cases with a VAR, b LSTM, c SEIR-HDQV model

Table 2 Recorded fatalities for the next 24 h based on performance metrics for different countries

The outcomes of the experiments in terms of error metrics are performed in Tables 1 and 2 for recorded confirmed and mortality cases, respectively. The forecasts achieved by the models for new incidences in Brazil are illustrated in Fig. 8. The visualization indicates that LSTM forecasted values are relatively adjacent to actual values. Figure 8c represents that infectious, cured, and fatality individuals have considerably increased over 30 days of exposure for the enhanced version of SEIR, i.e., SEIR-HDQV, due to the presence of \({\mathcal {N}}\) among 1000 susceptible individuals, while susceptible individuals have diminished.

Fig. 9
figure 9

Different models error metric on the OWID-COVID dataset a RMSE for new cases, b RMSE for new deaths, c MAE for new cases, d MAE for new deaths

For the reported and recorded incidences, Table 1 analyzes the effectiveness of the models in relation to performance metrics. Table 2 demonstrates the mortality outcomes for the selected nations. According to the analysis, RMSE and MAE values for confirmed new incidences and fatalities in Russia are minimal. However, the values for the United States are comparatively higher in all models. The LSTM model achieved the lowest RMSE and MAE values for cumulatively reported and deceased cases.

Fig. 10
figure 10

Models error metric for the OWID-COVID dataset by using MAPE a new cases and b new deaths

Figures 9 and 10 depict the box plot for the visual interpretation of data values in Tables 1 and 2. For the new instances in Fig. 9a, the RMSE values range from 0 to 0.3 million. The corresponding box plot median values for VAR, LSTM, and SEIR-HDQV models are 18,305, 9707, and 28,909, respectively. While the RMSE values of the fatalities range from 0 to 1600, the median values are 250.635, 92.888, and 139.964, respectively, as can be seen in Fig. 9b. In both scenarios, LSTM median values are the lowest for recorded incidences and fatalities over the next 24 h. MAE values for reported incidences and fatalities display a similar pattern in Fig. 9c and d. Using MAPE, the performances of different models can be compared on the same scale, identifying the most efficient one. The median values of MAPE for reported incidences and fatalities range from 0.4 to 3.0 and 0.2 to 2.4, respectively; the same can be visualized in Fig. 10a and b. The median values of MAPE for LSTM are the lowest among all models, i.e., 1.026 for reported incidences and 1.022 for fatalities. The outcomes demonstrate that LSTM outperforms other models for forecasting purposes.

Fig. 11
figure 11

a COVID-19 deaths worldwide on daily basis and b COVID-19 cases increase since inception

Impact of Vaccination: Fig. 11 illustrate a progressive improvement in epidemic management and a leveling of their curves in relation to the number of daily reported and fatal cases. Due to multiple waves occurring at intervals, the number of incidences in various regions of the globe has been gradually rising. Large-scale vaccination programs are launched when the COVID-19 vaccine is released publicly at the beginning of 2021. Despite this, vaccine availability issues, vaccine reach among large populations, and the time required for vaccines to acquire virus immunity all contributed to the slow and early reduction of fatalities. Figure 11a and b shows that there seem to be fewer fatalities despite an increase in the incidence of Coronavirus variants such as omicron because of the development of an antiviral vaccination. The peak of incidences, which took place between days 700 and 800, can be seen in Fig. 11b. However, as shown in Fig. 11a, which emphasizes the vaccine’s achievement in halting the spread of the disease, shows that the number of fatalities has not increased during this time.

Different studies have been carried out to predict confirmed incidents and mortality. Xu et al. (2022) applied LSTM, CNN, and CNN-LSTM deep learning techniques to predict confirmed incidents and mortality regularly in India, Russia, and Brazil for the span from July 14 to July 31, 2021. For these three countries, the MAE of prediction using LSTM is 8949, 1198, and 15275, respectively. Although, the CNN-LSTM model’s MAE for these nations is 3214, 572, and 4321, respectively. Additionally, based on the RMSE and MAPE of various models from June 21 to July 10, 2021, Verma et al. (2022) predict the Coronavirus incidences for the next (7, 14, 21) days. The models’ RMSE and MAPE values were (4067.74, 4385.09, 4431.91) and (7.95, 8.1, 8.75), respectively. The proposed model shows minimal errors in regard to MAE, RMSE, and MAPE based on extensive experiments.

The computational complexity analysis measures how long the algorithm takes to process input for the studied models. Among the studied models, the training time complexity of VAR is \(O(m*n^3)\), where m is the number of features and n is the number of training instances. Therefore, it works better for small datasets, but the time difference grows exponentially with the increase in the data size. For LSTM, the complexity is O(w) per time step, where w represents the number of weights. Further, the time complexity of SEIR-HDQV is \(O(m*n)\) where m and n represent the number of features and the number of training instances, respectively.

The proposed model includes features like hospitalization, death, quarantine, and vaccination that have been ignored by other models in this domain. Each feature has been assigned weights and the information has been incorporated into the model. Further, meteorological data have also been analyzed to visualize the effect of weather on virus dissemination.

Fig. 12
figure 12

Temperature and humidity analysis for reported incidences in the state a Kerala and b Delhi

The models’ shortcomings become apparent when a pathogenic virus shows mutations since the extent of the epidemic becomes highly dynamic and vulnerable to unanticipated occurrences like lockdowns, vaccination circulation, and mutations. As a result of these drastic changes, model analysis, and projected future outcomes might become less reliable. The compartmental SEIR-HDQV model does not consider other elements that might affect the transmission of Coronavirus, such as environment, transportation limitations, and underlying medical problems in communities.

Fig. 13
figure 13

Temperature and humidity analysis for reported incidences in the state a Gujarat and b Bihar

Temperature and Humidity Analysis:

The pathogenic virus morbidity may be influenced by environmental factors, such as temperature, relative humidity, and air pollution (Ma et al. 2020; Tosepu et al. 2020; Xie and Zhu 2020; Shrivastav and Jha 2021). The climate of various countries is extremely hostile because the infectious virus mutates differently under various environmental conditions. Since several geographical locations of a country may experience different temperatures and humidity, there is a lack of a global index for a nation’s temperature and humidity. During the Indian epidemic, this study has analyzed Coronavirus dissemination at diverse temperatures and humidity levels.

Geographic locations in India are selected for a wide range of temperatures and humidity levels, including Kerala, Delhi, Gujarat, and Bihar. These locations are located in distinct directions on the map. The hazardous virus transmission rate has been documented for each day ranging from March 14, 2020, to July 12, 2020, in the selected locations. Figures 12 and 13 show the prevalence of epidemics in the four distinct states when ambient temperature and relative humidity vary. The verified cases are stable in the presence of 28.5 °C and higher than \({50}{\hbox {Km h}}^{-1}\) relative humidity, as shown in Fig. 12a. Kerala’s reported incidences increased with a drop in temperature and a reduction in humidity, whereas the verified incidences spiked (peaked) when the temperature remained consistent, although the humidity varied. From March 14 to April 3, 2020, Delhi’s recorded cases remained consistent at 20–25 °C and high humidity. However, the reported incidences increased with humidity and temperature rise, as shown in Fig. 12b. Figure 13a shows a similar analysis and 13b for Gujarat and Bihar (states of west and east India, respectively).

6 Conclusion

This paper analyzes the performance of mathematical learning models, including SEIR-HDQV, VAR, and LSTM, used for forecasting incidences and fatalities over the ensuing days. The error metrics RMSE and MAE indicate a higher variance for the different countries, owing to the difference in population. A comparison of all models shows that LSTM has the lowest MAPE score, demonstrating that it outperforms all others. The global impact of pathogenic viral vaccinations indicates that when vaccine doses are delivered to individuals throughout this epidemic, the mortality rate drops even when reported incidences increase considerably. Furthermore, the effect of ambient temperature and relative humidity on reported incidences in India demonstrates that climatic factors influence pathogenic virus transmission. Several models may also be employed to foresee the number of hospitalized patients, quarantined patients, and so forth.

Future Directions: Future studies can examine the drop in the death rate during Coronavirus waves in nations where most of the population is vaccinated versus those where people were not adequately vaccinated. Further, researchers may focus on the vaccine’s lasting efficacy, the creation of booster shots, and its effect on recently developed variants. Additionally, researchers may study Coronavirus fatality and morbidity rates affected by levels of air pollution in distinct geographical locations. Furthermore, it might also be possible to create effective interventions in the future based on the enduring psychological consequences of the pandemic on individuals and communities.