Multivariate time series short term forecasting using cumulative data of coronavirus

Mishra, Suryanshi; Singh, Tinku; Kumar, Manish; Satakshi

doi:10.1007/s12530-023-09509-w

Multivariate time series short term forecasting using cumulative data of coronavirus

Original Paper
Published: 04 June 2023

Volume 15, pages 811–828, (2024)
Cite this article

Download PDF

Evolving Systems Aims and scope Submit manuscript

Multivariate time series short term forecasting using cumulative data of coronavirus

Download PDF

Suryanshi Mishra ORCID: orcid.org/0000-0001-6613-2258¹,
Tinku Singh²,
Manish Kumar² &
…
Satakshi¹

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Coronavirus emerged as a highly contagious, pathogenic virus that severely affects the respiratory system of humans. The epidemic-related data is collected regularly, which machine learning algorithms can employ to comprehend and estimate valuable information. The analysis of the gathered data through time series approaches may assist in developing more accurate forecasting models and strategies to combat the disease. This paper focuses on short-term forecasting of cumulative reported incidences and mortality. Forecasting is conducted utilizing state-of-the-art mathematical and deep learning models for multivariate time series forecasting, including extended susceptible-exposed-infected-recovered (SEIR), long-short-term memory (LSTM), and vector autoregression (VAR). The SEIR model has been extended by integrating additional information such as hospitalization, mortality, vaccination, and quarantine incidences. Extensive experiments have been conducted to compare deep learning and mathematical models that enable us to estimate fatalities and incidences more precisely based on mortality in the eight most affected nations during the time of this research. The metrics like mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE) are employed to gauge the model’s effectiveness. The deep learning model LSTM outperformed all others in terms of forecasting accuracy. Additionally, the study explores the impact of vaccination on reported epidemics and deaths worldwide. Furthermore, the detrimental effects of ambient temperature and relative humidity on pathogenic virus dissemination have been analyzed.

Assessing Spatiotemporal Transmission Dynamics of COVID-19 Outbreak Using AI Analytics

Forecasting COVID-19 Cases in Morocco: A Deep Learning Approach

Predicting Future, Past, and Misinterpreted COVID-19 Cases Using Bidirectional LSTM Model for Proper Health Monitoring: Advances in Data Analytics

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In December 2019, SARS-CoV-2-induced Coronavirus Disease (COVID-19) disseminated to the global regions, infecting countless individuals and leading to health-related issues. On March 11, 2020, the World Health Organization (WHO) declared the illness to be an epidemic due to the virus (Organization WH 2019), which is highly contagious and pathogenic. This infectious virus severely affects an individual’s respiratory system. It became a universal pandemic because it is transmissible among humans. It spreads through microscopic liquid particles from an infected patient’s lips or nose while they cough, sneeze, speak, sing, or breathe. From large-scale respiratory drops to tiny particles, these aerosols come in a variety of sizes. The Coronavirus has been spreading rapidly in all global areas. A number of vaccines have been examined by a few drug management organizations worldwide to prevent the epidemic and to reduce the risk of contracting the disease. WHO has approved the following vaccines that have fulfilled the requirements for safety and effectiveness as of November 15, 2021: AstraZeneca/Oxford Vaccine/Covishield, Johnson & Johnson, Moderna, Pfizer, Sinopharm, Sinovac, and COVAXIN (WHO 2020). Although vaccines act as a protective barrier against symptoms, they have a few adverse effects as well. The most common ones include arm soreness, mild fever, tiredness, headaches, and muscle or joint aches.

Estimating future values is a vital part of data science and automation technologies that involve historical data to develop a model. Future values can be extrapolated using the models. The epidemiological data is collected periodically (e.g., daily, weekly, or monthly), so it is considered time-series data. It is a series of time-ordered data points associated with single or multiple time-dependent variables. By analyzing the time series, it is possible to study the nature of time-dependent data and predict its future values based on the past variability of the data. Machine learning (ML) approaches are considered effective in forecasting time series data. These approaches can assist in rapidly identifying potential cases and fatalities. Further, they can be helpful in the efficient estimation of the recorded incidences of high risk for pathogenic virus transmission and monitoring their outbreak. These algorithms can process user data regarding Coronavirus patients, giving clinicians more time and assurance while treating a critical illness (Tuli et al. 2020). It has been considered one of the most promising computing approaches, with significant potential for epidemic forecasting. Several recent studies have highlighted the tremendous potential of ML algorithms to combat pathogenic viruses (Alimadadi et al. 2020; Ardabili et al. 2020; Miralles-Pechuán et al. 2020). ML algorithms have been used efficiently for mitigation and prevention, including the identification of new pathogens, classification of novel pathogens, diagnosis, survival prediction, and intensive care unit (ICU) demand prediction (Randhawa et al. 2020; Rao and Vazquez 2020; Yan et al. 2020; Grasselli et al. 2020). Past research utilized statistical and deep learning models like vector autoregression (VAR) and long-short-term memory (LSTM) to evaluate and forecast the dynamic trajectory of the epidemic. The adaptability of analytical methodologies can be used by modeling frameworks employing machine learning or deep learning methods to forecast temporal dynamics.

A progressive model might help to estimate the disease’s probable trajectory based on features (feedback data). This data can be used to predict certain factors, such as the number of new cases and fatalities in the future, and to analyze the severity of the outbreak. One of the widely used approaches to examining COVID-19 dissemination is using the susceptible-exposed-infected-recovered (SEIR) model. It is an ordinary differential equation-based model, so it struggles to account for geographical processes and spatial heterogeneity. Despite this, human movement patterns today are more noticeable and understandable. Therefore, improbable estimations may result from failing to incorporate modern movement patterns. This study utilizes state-of-the-art deep learning and statistical models, including VAR, LSTM, and an enhanced version of the SEIR model for improving forecasting accuracy. The sequential SEIR model has been enhanced and transformed into the susceptible-exposed-infected-recovered-hospitalized-death-quarantined-vaccination (SEIR-HDQV) model for effectively estimating fatalities and incidences. These models have been employed to anticipate the recorded incidences and casualties over the next 24 h caused by the hazardous virus.

Viruses and climate variability affect marginalized communities disproportionately. A variety of factors can influence COVID-19 virus dissemination, including climate conditions. The epidemiological dynamics of this form of infectious disease may be altered by environmental exposures, such as short and long-term climatic variations. The pace of Coronavirus dissemination diverges by the country during the epidemic; one possible factor could be the weather. The incubation period of this disease, along with its spatial distribution, was found to be significantly affected by climate and weather conditions in a few studies (Mao et al. 2022; Karim and Akter 2022). There is a wide variation in temperature and humidity across different nations, so it is inappropriate to consider worldwide climate conditions to be similar everywhere. Thus, the study focuses on the meteorological conditions in the selected country during the span of Coronavirus outbreaks. As a result, meteorological data have been evaluated to assess the impact of climate on recorded incidences in India during the epidemic. The main contributions of this paper are as follows:

It presents a framework utilizing time series forecasting models to forecast the new incidences and fatalities due to contagious virus for the next 24 h.
This paper examines the impact of temperature and humidity on the dissemination of COVID-19 in India under various climate conditions.
It extends the compartmental SEIR model that employs an individual’s vulnerability to hazardous pathogens for effective prediction of incidences and fatalities.
This paper includes data on the eight countries most severely affected by the COVID-19 pandemic.
This paper analysis the fatalities associated with Coronavirus prior to and after the vaccine became available for the disease.

The rest of the paper is organized as follows: A literature survey is presented in Sect. 2. Section 3 explains the methodology and implementation of the forecasting models. The model evaluation is presented in Sect. 4. Section 5 provides results and discussion and analyzes the effect of climate during the outbreak of Coronavirus. Section 6 contains the conclusion and future directions.

2 Literature survey

Time-series data has been widely utilized in several application areas, including weather forecasting (Wu et al. 2020), earthquake prediction (Xue et al. 2021), signal processing (Wang et al. 2023), pattern recognition (Wu et al. 2023), and other domains. COVID-19 outbreak has been studied through several neurobiological, quantitative, and time series methods to anticipate infection incidence, fatalities, and evolution. Many unknown aspects condition the current pandemic’s expansion, including the uniquely and physiologically developed virus, human behavior, and different national policies. The studies (Wu et al. 2020; Irfan et al. 2022) highlight the effects of temperature and humidity on the spread of the pandemic across lower and higher quantiles. The influence of weather on Coronavirus has been explored in studies (Gupta et al. 2020; Mousavi et al. 2020; Singh et al. 2023; Wang et al. 2020; Auler et al. 2020), which majorly depend on aspects relating to temperature, relative humidity, wind speed, rainfall, solar irradiation, transmission rate, daily new confirmed cases, and mortality rate. It is found that the number of incidences caused by the novel Coronavirus is correlated with the humidity and temperature. In tropical nations, temperature has a minimal impact on Coronavirus case-to-mortality ratios. Based on Mohammadi et al. (2020), the association between the weather and the spread of Coronavirus has also been examined, as well as the number of fatalities in several American states of the USA. Rashed et al. (2020) have aimed to analyze the spread of the pathogenic virus using multivariate analysis based on the ambient temperature, relative humidity, and population density.

The related vaccine is one of the most effective weapons in fighting against the pathogenic virus. The vaccination creates antibodies in humans that are strong enough to stop the diseases from spreading. Ong et al. (2020) created an efficient and prominent vaccine utilizing reverse vaccinology and machine learning techniques. Reverse vaccinology (RV) attempts to find potential vaccine candidates via genetic analysis and has revolutionized vaccine development. Vaccines that comprise the complete virus can elicit immunity and defend against infections. Cotfas et al. (2021) studied the behavior of public sentiment on vaccination by looking at the time after the first vaccine declaration up to the first vaccination in the United Kingdom, throughout which civil society showed an enormous focus on the vaccination drive. Liu et al. (2021) concentrated on numerous vaccine hesitancy analyses and news reports. They presented a comparative study of three classifiers: the Naive Bayes classifier, the support vector machine (SVM), and logistic regression. SVM with term frequency-inverse document frequency (TF-IDF) and Synthetic Minority Over Sampling Technique (SMOTE) performed better among all. The accuracy of the SVM and LR for 12 classes is thoroughly stable, but the accuracy of Naive Bayes has fluctuated substantially.

Sadik et al. (2020) analyzed the different methods for forecasting the viral outbreak in Bangladesh. They used the Susceptible, Infected, and Recovered (SIR) model to predict the pandemic. The model’s outcome is inadequate for long-term prediction due to the inconsistency of the affecting factors. Furthermore, they utilized three machine learning models-Polynomial Regression (PR), LSTM, and Multilayer Perceptron-to predict the number of infections, deaths, and recoveries. Rauf et al. (2021) discussed an optimized LSTM model to forecast the pathogenic virus-confirmed cases based on mean absolute error (MAE). They compared the recurrent neural network (RNN), non-optimal LSTM, gated recurrent networks (GRU), and recent state-of-the-art algorithms. LSTM models outperformed other latest algorithms in terms of accuracy. Shastri et al. (2021) studied the optimized deep learning ensemble models to analyze the confirmation and death cases in India. Based on the reported incidences and fatalities, the mean absolute percentage error (MAPE) values are 2.40 and 1.11. Mishra et al. (2022) analyzed the amount of fatalities against the regular growth of Coronavirus-infectious individuals during the epidemic, including the days when a vaccine was available, by employing the deep learning approach. Using another machine learning approach, Agarwal and Dutta (2022) have analyzed the vaccination with a predicted mortality rate of 15.53% and a reduction in confirmed cases of 24.67%.

With the help of extreme learning machines (ELMs) and Chimp optimization algorithms, researchers Hu et al. (2021) and Cai et al. (2023) have developed a real-time COVID-19 diagnosis based on chest X-ray images. They categorize the chest X-ray images in two steps: using a deep CNN initially to extract features and then ELMs later to determine the diagnosis. Saffari et al. (2022) utilizes artificial intelligence to detect COVID-19 diseases from X-ray images. They employed the whale optimization algorithm (WOA) in a fuzzy system for training the deep convolutional neural network (DCNN). DCNN particle swarm optimization, DCNN genetic algorithm, and LeNet-5 benchmark models have been employed for better comparison. Ustebay et al. (2023) described the prognostic and diagnostic paradigms of COVID-19 to support clinical decision-making. They used eight ML algorithms and explained that the added tree and CatBoost classifiers outperformed other studied models. Subudhi et al. (2021) assess the effectiveness of eighteen ML algorithms for forecasting ICU admission and mortality across COVID-19-infected individuals. Predicting COVID-19 mortality was found to be more accurate using ensemble-based models compared with other models. Dietterich (1998) learned how to contrast supervised learning algorithms employing a statistical test. The study by Xing et al. (2022a) proposes a robust semi-supervised time-series classification (TSC) along with self-distillation, which is a hybridization of supervised, unsupervised, and self-distillation (SD) techniques. An effective federated distillation learning system (EFDLS) for multitasking TSC is presented by Xing et al. (2022b), which consists of a central server to enable numerous mobile users to carry out various TSC tasks. Xiao et al. (2021) suggested an innovative, robust temporal feature network (RTFN) with an LSTM-based attention network (LSTMaN). The RTFN-based frameworks perform better for both supervised and unsupervised learning, respectively.

Existing statistical epidemiological forecasting models, such as SIR and SEIR, use a significant feature to predict new incidences and mortality. For better prediction of a pandemic, additional factors like hospitality, quarantined, symptomatic and asymptomatic incidences, etc., may be included as features. According to the specified literature survey, deep learning models are the most effective at forecasting epidemiology. An updated feature set can further refine them for more accurate forecasting. Moreover, because the countries cover a large geographical area, the temperature and humidity features vary depending on location, and no temperature index represents the entire country. Studies examining the climate impact of COVID-19 recorded incidences used only one region, so they might not account for the variations between locations in temperatures and humidity levels.

3 Methodology

The studies were carried out in several steps that involved short-term forecasts of recorded incidences (cases) and fatalities, the impact of vaccination on mortality, and climate effects on the spread of the pathogenic virus. The OWID-COVID dataset contains information on the COVID-19 epidemic, including incidences, fatalities, hospitalizations, vaccinations, and so forth. The data is recorded at regular intervals (day basis), making it a time series. This time series dataset is applied for forecasting new cases, fatalities, and vaccination impacts.

The proposed methodology workflow is depicted in Fig. 1. A time series must first be transformed into a supervised ML problem to be estimated. In the ML approach, raw data must be pre-processed before being used for model training. During the pre-processing step, missing values are eliminated and regular observations are arranged into a single vector. Subsequently, an overview of the correlation matrix analysis and feature selection process has been performed as a component of the pre-processing step to prepare the data for epidemic training. Principal component analysis (PCA) is applied to the dataset to map the features on the lower dimension, while normalization and smoothing are employed to transform the data. The time series data is shifted, adopting the appropriate lag values, to make the time series dataset acceptable for forecasting through supervised learning algorithms. For model training and testing, the dataset is used in a ratio of 75% and 25%, respectively. In the statistical and deep learning categories, VAR and LSTM are regarded as state-of-the-art time series forecasting models and are utilized for effective time series forecasting. The compartmental SEIR model has been enhanced and transformed into the SEIR-HDQV model for efficient Coronavirus outbreak prediction. Multiple evaluation parameters are employed to evaluate the forecasting models against the testing data, including root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). Further, the OWID-COVID dataset is mixed with the climate dataset to analyze the temperature and humidity effects on the spread of the Coronavirus due to climate change. The subsequent sections provide a detailed explanation of the adopted methodology.

3.1 Pre-processing

Trend/Seasonality Removal:The trend is part of a time series that depicts low-frequency fluctuations after high and medium-frequency variations (Maurya and Singh 2020). Seasonality represents the time series property during the epidemic in which the data generates predictable and recurring variations throughout the outbreak. A differencing technique can be applied to organize trends and seasonality in a time series. By inserting the lag ${\mathcal {H}}$ difference operator $\nabla _ {\mathcal {H}}$, non-seasonal data can also interact with the seasonality of duration ${\mathcal {H}}$, which is defined as:

$$\begin{aligned} \nabla _{\mathcal {H}} {\mathcal {Y}}_t = {\mathcal {Y}}_t - {\mathcal {Y}}_{t-h} \end{aligned}$$

The operator $\nabla _ {\mathcal {H}}$ is applying to the model,

$$\begin{aligned} {\mathcal {Y}}_t = {\mathcal {N}}_t + {\mathcal {R}}_t + {\mathcal {X}}_t \end{aligned}$$

where, ${\mathcal {R}}_t$ has period ${\mathcal {H}}$, we get the equation:

$$\begin{aligned} \nabla _ {\mathcal {H}} {\mathcal {Y}}_t = {\mathcal {N}}_t - {\mathcal {N}}_{t- {\mathcal {H}}} + {\mathcal {X}}_t - {\mathcal {X}}_{t-{\mathcal {H}}} \end{aligned}$$

which gives a decomposition of the difference $\nabla _ {\mathcal {H}} {\mathcal {Y}}_t$ into a trend component $({\mathcal {N}}_t - {\mathcal {N}}_{t-{\mathcal {H}}})$, $({\mathcal {X}}_t - {\mathcal {X}}_{t- {\mathcal {H}}})$ is a noise term and removes the seasonality.

3.2 Feature selection

Irrelevant features of a dataset should be evicted to minimize the computational complexity of modeling, which may enhance the model’s performance. Domain knowledge might be used to eliminate irrelevant features from a dataset, which can then be further reduced by using dimensionality reduction techniques. PCA (Wold et al. 1987) is an efficient method for dimensionality reduction. This algorithm converts dataset features into principal components with linearly uncorrelated characteristics. It uses eigenvalues for compressing the dimension of instances while preserving crucial data (Johnstone 2001). The orthogonal linear transformation converts the features to a new coordinate system in such a way that the highest variance in the given coordinate system falls on the first component (first principal component) and still falls on the second component (second principal component), the scale of the variables affects the output of PCA. Multiple steps can be taken to evaluate the algorithm:

Step 1- Ensure that the range of continuous initial variables is through standardization.

$$\begin{aligned} {\mathcal {S}}& = \frac{Feature \, Value \mathcal {(F)} - Mean \mathcal {(M)} }{Standard\, Deviation (\sigma )} \end{aligned}$$

Step 2 - Determine the correlations by calculating the covariance matrix.

$$\begin{aligned}{} & {} \begin{pmatrix} Var(Y_1) &{} \hdots \hdots &{} Cov(Y_1, Y_p)\\ &{} &{} &{} \\ \vdots &{} \ddots &{}\vdots \\ &{} &{} &{} \\ Cov(Y_p,Y_1)&{} \hdots \hdots &{} Var(Y_p) \end{pmatrix} \\{} & {} Var(Y_i) = \sum _{k=1}^{p} \sum _{l=1}^{p} e_{ik} e_{il} \sigma _{kl}\\{} & {} Cov(Y_{i-1}, Y_i) = \frac{1}{{\mathcal {N}}} \sum _{i=1}^{p} (Y_{i-1} - {\mathcal {M}}) (Y_i - {\mathcal {M}}) \end{aligned}$$

Here, $Y_i$ is the random function (a predicted form of $(X_1, X_2,\ldots,X_p)$) of the OWID-COVID instances, and $e_{ip}$ is the regression coefficient, whereas ${\mathcal {M}}$ is the mean of the instances. ${\mathcal {N}}$ represents the total number of instances of the dataset.

Step 3 - Compute the eigenvalues and eigenvectors of the given covariance matrix to identify the principal components of the Coronavirus instances.

$$\begin{aligned} Cov(Y_i, Y_j) = \sum _{k=1}^{p} \sum _{l=1}^{p} e_{ik} e_{jl} \sigma _{kl} \end{aligned}$$

Step 4 - Formulate a feature vector to decide which principal components to keep.

$$\begin{aligned} Cov(Y_{i-1},Y_i) = \sum _{k=1}^{p} \sum _{l=1}^{p} e_{(i-1)k} e_{il} \sigma _{kl} \end{aligned}$$

Figure 2 represents the PCA analysis of the OWID-COVID dataset. Based on the domain knowledge, fourteen features are considered for further experiments. Additionally, PCA is applied to the selected features, and the rigorous experiment depicts that the five principal components are sufficient to make the classes distinct and separable. In the first principal component, the variance of the new vaccinations (vaccinations), the total vaccinations, the vaccinated, and the fully vaccinated is higher. It shows a positive correlation between total deaths, ICU patients, and hosp patients, while the positive rate, excess mortality, and cardiovascular death rate have low variances. The second principal component has a higher variance among the aged 65 older and those aged 70 older. It shows that total vaccinations, people vaccinated, and people fully vaccinated are positively correlated, while excess mortality and cardiovascular death rate are highly negatively correlated.

A correlation heatmap illustrates the relationships between variables by visualizing a correlation matrix. The correlation between the variables on each axis can be seen in each square box and varies from -1 to 1. The entire dataset is applied to see whether the data are related. By establishing a heatmap portraying the distribution of various factors (such as diabetes, new cases, new fatalities, and so on) around the world, the correlation coefficient has been ascertained. In Fig. 3, a perfect positive association is observed: new cases per week are strongly associated with new deaths and positively correlated with new cases. Although the number of newly reported cases per million is negatively associated with overall fatalities, it is inversely related to new deaths and the number of deaths per week.

3.3 Forecasting models

A time series $\{Y_t \vert t \in T\}$ can be used to examine the outbreak for the OWID-COVID instances collected over a time interval T for the set of Coronavirus instances ordered through time. T stands for the index set of the dataset, which is distinct and evenly separated in time. The random variable $Y_t$ represents the dataset features at any time t. Let $i\in {\mathbb {N}}$, $T \subseteq {\mathbb {R}}$, a function $y: T \rightarrow {\mathbb {R}}^{i}$, $t \xrightarrow {y_t} {\mathbb {R}}^{i}$ or, similarly, a set of indexed elements of ${\mathbb {R}}^{i}$,

$$\begin{aligned} \{y_{t} \vert y_{t} \in {\mathbb {R}}^i, t \in T\} \end{aligned}$$

is an observed time series. The mean function of the features is defined as:

$$\begin{aligned} \mu _{t} = E[Y_t], \forall t \in T. \end{aligned}$$

For a time series process $(Y_t)$, the variance function of the features is defined as if $\forall t \in T$:

$$\begin{aligned} \sigma _{t}^2 = Var [Y_t] = E[Y_{t}^2] - [E[Y_t]]^2, \forall t \in T \end{aligned}$$

For OWID-COVID instances, we assume that the mean and variance are constant. Therefore, the estimates are:

$$\begin{aligned} {\hat{\mu }}& = \frac{1}{n} \sum _{t=1}^{n} Y_t, \\ {{\hat{\sigma }}}^2& = \frac{1}{n-1} \sum _{t=1}^{n} (Y_t - \mu )^2 \end{aligned}$$

The covariance and correlation functions define the level of dependency between the two features (random variables) $X_p$ and $X_q$ to the dataset. Let $\gamma _{p,q}$, and $\rho _{p,q}$ be the auto-covariance function (ACVF) and auto-correlated function (ACF) of the dataset, then the time series of the features $\{X_p, X_q \vert p, q \in T\}$ is defined as:

$$\begin{aligned} \gamma _{p,q}&= Cov[X_p,X_q] = E[(X_p- E[X_p])(X_q - E[X_q])]\\&= E[X_p X_q] - E[X_p]E[X_q] \\ \rho _{p,q}&=Corr[X_p,X_q] = \frac{Cov[X_p,X_q]}{\sqrt{Var[X_p]Var[X_q]}} \end{aligned}$$

For any two sets of features $(r_1, r_2,\ldots, r_n)$ and $(s_1,s_2,\ldots,s_n)$ of a dataset, where n represents the number of instances, the sample covariance and correlation functions are given as:

$$\begin{aligned} {{\hat{\gamma }}}_{r,s}&= \frac{1}{n-1} \sum _{t=1}^{n} (r_t - \bar{r})(s_t - \bar{s}) \end{aligned}$$

(1)

$$\begin{aligned} {{\hat{\rho }}}_{r,s}&= \frac{\sum _{t=1}^{n} (r_t - \bar{r})(s_t - \bar{s})}{\sqrt{ \sum _{t=1}^{n}{(r_t -\bar{r})}^2 \sum _{t=1}^{n}{(s_t - \bar{s})}^2}} \end{aligned}$$

(2)

where ${{\hat{\rho }}}_{r,s}$ is the ACF of the stochastic process of the instances; for time series instances, the ACVF and ACF measure the covariance/correlation between the single time series instances $(r_1, r_2,\ldots,r_n)$ and themselves at different lags. By using Eqs. 1 and 2 at lag 0, the ${{\hat{\gamma }}}_{0}$, is the covariance of $(r_1, r_2,\ldots, r_n)$ with itself, then the ACVF of the Coronavirus instances is:

$$\begin{aligned} {{\hat{\gamma }}}_{0}& = \frac{1}{n-1} \sum _{t=1}^{n}(r_t - \bar{r})(r_t - \bar{r}) \\ {{\hat{\gamma }}}_{0}& = \frac{1}{n-1} \sum _{t=1}^{n}(r_t - \bar{r})^2 \end{aligned}$$

Likewise, ${{\hat{\rho }}}_{0}$ be the ACF of the OWID-COVID instances, the correlation lies then,

$$\begin{aligned} {{\hat{\rho }}}_{0}&= \frac{\sum _{t=1}^{n} (r_t - \bar{r})(r_t - \bar{r})}{\sqrt{ \sum _{t=1}^{n}{(r_t -\bar{r})}^2 \sum _{t=1}^{n}{(r_t - \bar{r})}^2}}\\ {{\hat{\rho }}}_{0}&= \frac{\sum _{t=1}^{n} (r_t - \bar{r})(r_t - \bar{r})}{ \sum _{t=1}^{n}{(r_t -\bar{r})} \sum _{t=1}^{n}{(r_t - \bar{r})}} = 1 \end{aligned}$$

For OWID-COVID instances, an autocorrelation function (ACF) or partial autocorrelation function (PACF) can be employed to aggregate the lag values between any two features. PACF and ACF are essential characteristics of the stochastic process $\{Y_t \vert t \in T\}$Analysis (2020). Let $Y_t$ be the stationary time series and $Y_{t-h}$ with the lagged value of h during the Coronavirus outbreak. PACF estimates the degree of correlation between any two instances of the dataset $Y_t$ and $Y_{t-h}$ but ignores the other time lags. The variance between x and $y_3$ can be calculated using the variables $y_1$ and $y_2$ as follows:

$$\begin{aligned} \frac{Cov(x,y_3\vert y_1, y_2)}{\sqrt{Var(x \vert y_1,y_2) Var(y_3 \vert y_1, y_2)}} \end{aligned}$$

where $y_1, y_2,$ and $y_3$ be the regression coefficients and x is the response variable. A partial correlation exists between x and $y_3$, describing their association with $y_1$ and $y_2$ and indicating how reliant on one another they are. We define first-order partial auto-correlation as being equal to first-order auto-correlation. For lag 2, PACF between two features is defined as follows:

$$\begin{aligned} \frac{Cov(y_t,y_{t-2} \vert y_{t-1})}{\sqrt{Var(x_t \vert x_{t-1})Var(x_{t-2} \vert x_{t-1})}} \end{aligned}$$

Figure 4 shows the autocorrelation between the correlated and lag values on the daily reported incidences and fatalities. Both Fig. 4a and b display a strong correlation between the current instance and the lag value of the historical thirty instances, then it can be used as a lag value for further analysis.

3.3.1 Stationary test

A time series is referred to as stationary if there is no trend or seasonal effect; therefore, summary statistics such as mean and variance tend to remain constant over time, thus making them easier to predict. A time series $\{X_t \vert t \in T\}$ is said to be strictly stationary or strongly stationary if the distributions of its instances $(X_{t_1},\ldots, X_{t_n})$ and $(X_{t_1 + s},\ldots, X_{t_n +s})$ are the same $\forall n$ and $t_1, t_2,\ldots,t_n, s \in T$. A time series $\{X_t \vert t \in T\}$ is said to be weakly stationary or covariance stationary, or second-order stationary if: the mean function of the time series is constant and finite; $\mu _X (t) = \mu < \infty$, $\forall t\in T.$ In the time series, the variance function is constant and finite: $Var(X_t) < \infty$, $\forall t \in T.$ ACVF and ACF both depend on the lag value. The concepts of the ACVF and ACF are as follows:

$$\begin{aligned} \gamma _{t, t+ \alpha }& = Cov [X_t, X_{t+\alpha }]= \gamma _{\alpha }, \forall t,t+\alpha , \alpha \in T \\ \rho _{t, t+ \alpha }& = Corr [X_t, X_{t+\alpha }] = \rho _{\alpha }, \forall t, t+\alpha , \alpha \in T \end{aligned}$$

An Augmented Dickey-Fuller (ADF) test is performed under the null hypothesis that a unit root exists in a time series sampled (Cheung and Lai 1995). It is employed to determine whether or not a time series sample is a random walk.

$$\begin{aligned} \Delta y_{t } = y_{t } - y_{t-1} = \alpha +\beta t+\gamma y_{t-1} + \epsilon _{t} \end{aligned}$$

where $\alpha$ is a constant and $\beta$ is the time trend coefficient. $y_{t-1}$ represents the value of a time series at lag order 1, $\epsilon _t$ represents the error term of the function, and $\gamma = 0$ indicates a random walk (non-stationary series). An ADF test incorporates higher-order regressive processes of the type $\Delta {\mathcal {Y}}_{t-p}$ where $1\le t$.

$$\begin{aligned} \begin{aligned} \Delta y_{t }&=\alpha +\beta t+\gamma y_{t-1} +\delta _{1}\Delta {\mathcal {Y}}_{t-1} + \delta _{2}\Delta {\mathcal {Y}}_{t-2} + \cdots\\&\quad + \delta _{p}\Delta {\mathcal {Y}}_{t-p} + \epsilon _{t} \end{aligned} \end{aligned}$$

As of time $(t-1)$, $\Delta {\mathcal {Y}}_{t-1}$ is equal to the first order difference in the series, and $(\delta _1, \delta _2,\ldots, \delta _p)$ is the coefficient of the $(\Delta {\mathcal {Y}}_{1}, \Delta {\mathcal {Y}}_2,\ldots, \Delta {\mathcal {Y}}_p)$. An ADF test involves testing a hypothesis (including the null and alternate hypothesis) by computing the test statistics and reporting the p-value. The p-value also known as the probability, is a measure of how likely the null hypothesis is to hold. If the p-value of the ADF test is less than or equal to 0.05, then the null hypothesis is rejected, and the series is determined to be stationary. When the ADF test is applied to the OWID-COVID dataset, the p-value is reported to be greater than 0.05, which indicates a non-stationary time series. First-order differencing is employed to make this series stationary, which results in a p-value below 0.05.

3.3.2 SEIR model

The time-dependent suscepted-exposed-infected-recovered (SEIR) model (Ghostine et al. 2021) is an epidemiological model that separates the overall population into four categories to predict epidemic outbreaks. The simplest model can be used to study the forecast of the recorded incidences and fatalities due to the spread of the virus. Susceptible$({\mathcal {S}})$, exposed(E), infected$({\mathcal {I}})$, and recovered$({\mathcal {R}})$ are the four sections of the time-dependent mathematical model. An individual is a member of an infected class that can infect others.

Let ${\mathcal {S}}(t)$, E(t), ${\mathcal {I}}(t)$, and ${\mathcal {R}}(t)$ be the fraction of the population for four groups at a time t (Kanpur 2020).

$$\begin{aligned} {\mathcal {S}}(t) + E(t) + {\mathcal {I}}(t) + {\mathcal {R}}(t) = 1 \end{aligned}$$

(3)

On differentiating the above equation with respect to time t, we get,

$$\begin{aligned} \frac{d{\mathcal {S}}}{dt} + \frac{dE}{dt} + \frac{d{\mathcal {I}}}{dt} + \frac{d{\mathcal {R}}}{dt} = 0 \end{aligned}$$

(4)

The fraction of infected individuals in a single day is:

$$\begin{aligned} \frac{d{\mathcal {S}}}{dt} = -\Psi {\mathcal {S}}{\mathcal {I}} \end{aligned}$$

(5)

The interaction between infected and susceptible individuals is represented by $\Psi$. Therefore, recovered individuals are directly proportional to infected individuals.

$$\begin{aligned} \frac{d{\mathcal {R}}}{dt} = \varkappa {\mathcal {I}} \end{aligned}$$

(6)

Here $\varkappa$ is the proportional constant. From Eq. 4, we get,

$$\begin{aligned} \frac{d{\mathcal {S}}}{dt} + \frac{dE}{dt} + \frac{d{\mathcal {I}}}{dt} + \frac{d{\mathcal {R}}}{dt} = 0 \\ \frac{d{\mathcal {I}}}{dt} = 0 \end{aligned}$$

Since there is no spreading to others at time t, ${\mathcal {I}}(t)$ becomes zero. Putting the above values in Eq. 4, we get,

$$\begin{aligned}{} & {} -\Psi {\mathcal {S}}{\mathcal {I}} + \frac{dE}{dt} + \varphi E + 0 = 0\nonumber \\{} & {} \frac{dE}{dt} = \Psi {\mathcal {S}}{\mathcal {I}} - \varphi E \end{aligned}$$

(7)

where $\varphi$ emphasizes the association between exposed and infected individuals. Putting all these values in Eq. 4, we get,

$$\begin{aligned} -\Psi {\mathcal {S}}{\mathcal {I}} + \Psi {\mathcal {S}}{\mathcal {I}} - \varphi E + \frac{d{\mathcal {I}}}{dt} + \varkappa {\mathcal {I}} = 0 \nonumber \\ \frac{d{\mathcal {I}}}{dt} = \varphi E - \varkappa {\mathcal {I}} \end{aligned}$$

(8)

The Eqs. 5, 6, 7, and 8 depict the rate of change of susceptible individuals, recovered individuals, exposed individuals, and infected individuals in the overall population.

3.3.3 SEIR-HDQV model

We extend the SEIR epidemiological model with nine additional phenotypes to simulate the outbreak (Vrabac et al. 2021). Figure 5 illustrates the SEIR-HDQV model transmission flow of individuals through the pathogenic virus. Based on Fig. 5, these stages are reflected to perceive an infected case’s entire life cycle: prior infection, throughout infection, and then after discharge, that is, either recovered or deceased. Consequently, every stage in this model is designed to describe the behavior of a specific population on a given day at a given time. Consider the following factors: ${\mathcal {S}}(t)$, ${\mathcal {V}}(t)$, E(t), ${\mathcal {I}}^{sym}(t)$, ${\mathcal {I}}^{asym}(t)$, ${\mathcal {Q}}(t)$, ${\mathcal {H}}(t)$, ${\mathcal {R}}(t)$, and ${\mathcal {D}}(t)$ denote the number of susceptible (susceptible unvaccinated), susceptible vaccinated (V(t)), exposed persons, symptomatic persons, asymptomatic persons, quarantined, hospitalized, recovered, and died at a time t, respectively. We describe the overall population, represented by ${\mathcal {N}}$, as ${\mathcal {N}}$ = ${\mathcal {S}}(t)$ + E(t) + ${\mathcal {I}}^{sym}(t) + {\mathcal {I}}^{asym}(t)$ + ${\mathcal {Q}}(t) + {\mathcal {H}}(t) + {\mathcal {R}}(t) + {\mathcal {V}}(t) + {\mathcal {D}}(t)$ based on the state definitions stated before at a time t. The nonlinear differential equations are evaluated below, based on the transmission model from Fig. 5:

$$\begin{aligned} \frac{d{\mathcal {S}}}{dt}&= -\frac{\Theta }{{\mathcal {N}}} [{\mathcal {I}}^{sym}(t) + {\mathcal {I}}^{asym} (t)]{\mathcal {S}}(t) - \vartheta {\mathcal {S}}(t) \\ \frac{dE}{dt}&=\frac{\Theta }{{\mathcal {N}}} [{\mathcal {I}}^{sym}(t) + {\mathcal {I}}^{asym}(t)]{\mathcal {S}}(t) + \varsigma \Theta [{\mathcal {I}}^{sym}(t) \\&\quad + {\mathcal {I}}^{asym}(t)]{\mathcal {V}}(t) - \delta \xi E(t) - (1-\delta )\lambda E(t) \\ \frac{d{\mathcal {I}}^{sym}}{dt}&= \delta \xi E(t) - \eta {\mathcal {I}}^{sym}(t) - \kappa {\mathcal {I}}^{sym}(t)\\ \frac{d{\mathcal {I}}^{asym}}{dt}&= (1- \delta ) \lambda E(t) - \phi {\mathcal {I}}^{asym}(t)\\ \frac{d{\mathcal {Q}}}{dt}&= \phi {\mathcal {I}}^{asym}(t) + \kappa {\mathcal {I}}^{sym}(t) - \omega {\mathcal {Q}}(t) - \Omega {\mathcal {Q}}(t)\\ \frac{d{\mathcal {H}}}{dt}&= \eta {\mathcal {I}}^{sym}(t) - \tau \rho {\mathcal {H}}(t) - (1-\tau ) \Pi {\mathcal {H}}(t)\\ \frac{d{\mathcal {R}}}{dt}&= (1- \tau )\Pi {\mathcal {H}}(t) - \Omega {\mathcal {R}}(t) + \omega {\mathcal {Q}}(t)\\ \frac{d{\mathcal {D}}}{dt}&= \tau \rho {\mathcal {H}}(t)\\ \frac{d{\mathcal {V}}}{dt}&= \vartheta {\mathcal {S}}(t) - \varsigma \Theta [{\mathcal {I}}^{sym} + {\mathcal {I}}^{asym}]{\mathcal {V}}(t) - \Omega {\mathcal {V}}(t) \end{aligned}$$

with non-negative initial conditions, ${\mathcal {S}}(0)\ge 0$, $E(0)\ge 0$, ${\mathcal {I}}^{sym}(0)\ge 0$, ${\mathcal {I}}^{asym}(0)\ge 0$, ${\mathcal {Q}}(0)\ge 0$, ${\mathcal {H}}(0)\ge 0$, ${\mathcal {R}}(0)\ge 0$, ${\mathcal {D}}(0)\ge 0$, and ${\mathcal {V}}(0)\ge 0$. The coefficients are: $\vartheta$ is the rate of interaction between the susceptible individuals with vaccinated individuals, $\Theta$ is the rate of contact persons between susceptible human beings to exposed human beings, $\varsigma$ is the rate of connected persons between vaccinated individuals to exposed individuals, $\delta$ is the rate of the symptomatic individuals connect with $\xi$ exposed individuals, and $(1-\delta )$ is the asymptomatic individuals interact with exposed individuals, $\eta$ is the rate of symptomatic infected individuals whose goes to hospitals, $\phi$ is the rate of asymptomatic infected individuals, that individuals are going to quarantine, $\kappa$ is the rate of symptomatic individuals that are going to quarantine, $\omega$ be the quarantined individuals, which are going to the recovered individuals, $\Omega$ is the natural death of the human beings, $\tau$ $(0\le \tau \le 1)$ is the rate of admitted in the hospital and going to the death $\rho$, and $(1-\tau )$ is the rate of admitted persons in the hospital which are going to the recovered $\Pi$.

3.3.4 Vector auto regression (VAR)

A VAR model is a technique for modeling dynamics among a set of k-variables (Brandt and Williams 2007), also called endogenous variables, over time. The variables are organized in a vector with k dimensions, $Y_t$, whose length is k. This method focuses on the dynamics of various time series and often employs multivariate and multiple regression techniques. When at least two or more time series instances are interdependent while the enclosed time arrangements are bi-directional, the VAR approach can serve as a prediction computation. The cumulative incidents and fatalities in the focused countries may be estimated using this model. COVID-19 is distinguished by an increase in new incidences that are positively related to fatalities. The mortality rate increases in tandem with the number of new cases. It is a more effective forecasting paradigm that can be produced using the VAR process, which enables the integration of both the number of newly diagnosed incidences and fatalities into an integrated framework. A $p^{th}$-order VAR contains lags over the most recent p-periods. VAR(p) is an abbreviation for a $p^{th}$-order VAR also expressed as a VAR with p-lags. Let $Y_t = \begin{bmatrix} Y_{t,1} \\ Y_{t,2} \\ \vdots \\ Y_{t,k} \end{bmatrix}$ represent the vector-valued time series that consists of k-individual time series. In this case, we assume that $Y_t$ is stationary, which means that the cross-covariance function $Cov(Y_{t, i}, Y_{r,j})$ depends only on $(r-t)$. The $p^{th}$-order VAR model can be stated as follows:

$$\begin{aligned} Y_t= \beta _1 Y_{t-1} + \beta _2 Y_{t-2} +\cdots+\beta _p Y_{t-p} +\epsilon _t \end{aligned}$$

where the constant terms $(\beta _1, \beta _2,\ldots, \beta _p)$ are the coefficients of the lags Y till order p, and the error term is $\epsilon _t$ whose dimension is k. For each i of $(i = 0, 1,\ldots, k)$, $Y_i$ represents the time-invariant matrix with the dimension $(k \times k)$. Error terms $\epsilon _t$ must satisfy three cases:

$E(\epsilon _t) = 0$, where zero is the mean of each error term. $E(\epsilon _t) = {\mathscr {K}}$, ensuring that the error term of the covariance matrix should be a positive-semi-definite matrix $k \times k$ denoted by ${\mathscr {K}}$. $E(\epsilon '_{t-k}) = 0$, make sure that the error terms do not have cross-temporal or linear correlations for any non-zero k.

A time series vector can be defined using the VAR(p) technique for short-term forecasting:

$$\begin{aligned} \begin{bmatrix} Y_t \\ G_t \\ \end{bmatrix} = \beta _1 \begin{bmatrix} Y_{t-1} \\ G_{t-1} \\ \end{bmatrix} + \beta _2 \begin{bmatrix} Y_{t-2}\\ G_{t-2}\\ \end{bmatrix} +\hdots + \beta _{p} \begin{bmatrix} Y_{t-p}\\ G_{t-p}\\ \end{bmatrix} + \begin{bmatrix} \epsilon _{t,1}\\ \epsilon _{t,2}\\ \end{bmatrix} \end{aligned}$$

In this scenario, several new incidences and fatalities are listed as $Y_t$ and $G_t$, respectively. A maximum likelihood estimation is employed to estimate the coefficient of the matrix $\beta _{j} = \begin{bmatrix} \beta _{11} + \beta _{12} \\ \beta _{21} + \beta _{22}\\ \end{bmatrix}$.

3.3.5 Long short term memory (LSTM)

The most effective component of the LSTM model is that it delineates and maintains an intrinsic memory cell status over the whole life cycle of individuals in order to create temporal correlations. LSTM is a form of artificial neural network (ANN) that is particularly good at solving regression and classification problems. It is a part of the recurrent neural network (RNN) that can handle long-term dependence, as represented in Fig. 6. The LSTM network is an enhanced version of RNN (sequential network) that allows data to persist. As seen in Fig. 6, LSTM cells consist of three sections called gates. The first section of the cell is known as the forget gate; the second part is familiar as the input gate; and the remaining part is known as the output gate (Graves et al. 2005). These three gates pass the information into and out of the memory cell, and the memory cell stores values across arbitrarily chosen time intervals. In a time-series domain such as estimating Coronavirus dissemination, at time $t = 1$ to N, it produces an output series $h = (h_1, h_2,\ldots, h_N)$ mathematically expressed as Hochreiter and Schmidhuber (1997), for a given set of input series $y = (y_1, y_2,\ldots, y_N)$.

The LSTM cell has three gates of the same shape, which are determined as follows:

$$\begin{aligned} f_t&= \sigma _g (W_{fy} * x_t + V_f* h_{t-1}+ k_f)\\ i_t&= \sigma _g (W_{iy} * x_t + V_i * h_{t-1} + k_i)\\ o_t&= \sigma _g (W_{oy} * x_t + V_o * h_{t-1} + k_o) \end{aligned}$$

The above three gates have a sigmoid activation function that creates smooth curves in the interval between 0 and 1. Here, tanh is the activation function and ranges from [-1, 1]. The next step is to transmit new information to the cell state via the feature of input x at moment t and a hidden state at the timing $(t-1)$.

$$\begin{aligned} c'_t&= \tanh (W_{cy} * x_t + V_c * h_{t-1} + k_c)\\ c_t&= f_t * c_{t-1} + i_t * c'_t \end{aligned}$$

The current cell output $h_t$ of the LSTM cell is defined by:

$$\begin{aligned} h_t = o_t * \tanh (c_t) \end{aligned}$$

where $f_t$, $i_t$, $o_t$, $c_t$, $h_t$, $\sigma _g$, $x_t$, and $h_{t-1}$ are known as the forget gate, input gate, output gate, memory cell, hidden state, sigmoid function, input to the current timestamp, and hidden state of the previous timestamp, respectively. The $c'_t$ is the internal consumption of the LSTM model and is used to generate $h_t$ and $c_t$. The weights $W_{fy}, W_{iy}, W_{oy}, W_{cy}$ are associated with the inputs, $V_f, V_i, V_o, V_c$ are the weight matrices with the hidden state. The $k_f, k_i, k_o, k_c$ are biased functions of the model. The weight matrices and biases are not time-dependent. In this case, LSTM is implemented to detect the dissemination of a pathogenic virus while accounting for uncertainties. The parameter tuning process of the LSTM architecture has been discussed in detail in the Sect. 5.

Table 1 Reported incidences for the next 24 h based on performance metrics for different countries

Full size table

4 Evaluation metrics

The root mean square error (RMSE) and mean absolute error (MAE) values are employed to evaluate the effectiveness of the models. Performance is optimal for estimated models with the lowest RMSE and MAE. The mathematical formulation of this performance evaluator is as follows:

$$\begin{aligned} RMSE&= \sqrt{ \frac{1}{n}{\sum _{i=1}^n (Y_{pre} - Y_{act})^2}}\\ MAE&= \frac{1}{n}{\sum _{i=1}^n abs(Y_{pre} - Y_{act})} \end{aligned}$$

where n is the number of observations, $Y_{pre}$ represents the predicted values, while $Y_{act}$ represents the actual values.

The mean absolute percentage error (MAPE) has been computed by adding percentage errors without regard to sign. It expresses the error as a percentage. Furthermore, the problem of positive and negative inaccuracies canceling out has been avoided because absolute percentage errors are employed.

$$\begin{aligned} MAPE = \frac{100\%}{n}\sum _{t= 1}^{n} \left\vert {\frac{O_t - F_t}{O_t} } \right\vert \end{aligned}$$

where n is the number of observations. The actual and forecasted values of the models are $O_t$ and $F_t$, respectively.

5 Results and discussion

The experiments for this paper have been performed using Google Collaboratory, which uses Python 3.7 and offers a single GPU cluster with an NVIDIA K80 GPU, 12 GB of RAM, and a clock speed of 0.82 GHz. The estimation techniques for the outbreak are applied using the methodology discussed in Sect. 3. The estimation of daily reported cases and fatalities could assist in real-time strategic planning. The OWID-COVID dataset is utilized to carry out the experiments, which contain the information of 163 nations (Cameron Appel and Beltekian 2019). For this study, the top eight nations have been selected based on their Worldometer rankings for maximum fatalities (Worldometer 2020). It is an open-source repository having the most recent information, such as newly reported cases, mortality, vaccination, hospitalization, and other attributes of COVID-19 vaccination progress worldwide. It has approximately 0.3 million instances with 67 different features. Data processing must be performed to ensure data integrity requirements and transform nominal data into numeric data. Further, the impact of climatic factors on the dissemination of pathogenic viruses has also been studied in a case study in India. The study utilizes meteorological data associated with temperature and relative humidity gathered from the Central Pollution Control Board (CPCB) (Room 2022).

This section addresses the individual impacts of COVID-19 along with the consequences of vaccination when it begins. All forecasting models-LSTM, VAR, and SEIR-HDQV-learn from historical data to predict recorded infected incidences and fatalities for the next 24 h. The dataset has been processed in a ratio of 75%: 25% for training and testing purposes. Among all the models, LSTM requires the optimal set of hyperparameters for effective forecasting. Extensive experiments are applied to find an optimal collection of hyperparameters utilizing grid search; the outcome of this experiment can be seen in Fig. 7. The consolidation of the parameters that led to the lowest MAE on the training dataset is highlighted by the green line. The final model was trained using these hyperparameters (epochs: 50 with early stopping; activation function: relu; optimizer: adam with learning rate 0.01; loss function: MSE; ) on the training data.

Table 2 Recorded fatalities for the next 24 h based on performance metrics for different countries

Full size table

The outcomes of the experiments in terms of error metrics are performed in Tables 1 and 2 for recorded confirmed and mortality cases, respectively. The forecasts achieved by the models for new incidences in Brazil are illustrated in Fig. 8. The visualization indicates that LSTM forecasted values are relatively adjacent to actual values. Figure 8c represents that infectious, cured, and fatality individuals have considerably increased over 30 days of exposure for the enhanced version of SEIR, i.e., SEIR-HDQV, due to the presence of ${\mathcal {N}}$ among 1000 susceptible individuals, while susceptible individuals have diminished.

For the reported and recorded incidences, Table 1 analyzes the effectiveness of the models in relation to performance metrics. Table 2 demonstrates the mortality outcomes for the selected nations. According to the analysis, RMSE and MAE values for confirmed new incidences and fatalities in Russia are minimal. However, the values for the United States are comparatively higher in all models. The LSTM model achieved the lowest RMSE and MAE values for cumulatively reported and deceased cases.

Figures 9 and 10 depict the box plot for the visual interpretation of data values in Tables 1 and 2. For the new instances in Fig. 9a, the RMSE values range from 0 to 0.3 million. The corresponding box plot median values for VAR, LSTM, and SEIR-HDQV models are 18,305, 9707, and 28,909, respectively. While the RMSE values of the fatalities range from 0 to 1600, the median values are 250.635, 92.888, and 139.964, respectively, as can be seen in Fig. 9b. In both scenarios, LSTM median values are the lowest for recorded incidences and fatalities over the next 24 h. MAE values for reported incidences and fatalities display a similar pattern in Fig. 9c and d. Using MAPE, the performances of different models can be compared on the same scale, identifying the most efficient one. The median values of MAPE for reported incidences and fatalities range from 0.4 to 3.0 and 0.2 to 2.4, respectively; the same can be visualized in Fig. 10a and b. The median values of MAPE for LSTM are the lowest among all models, i.e., 1.026 for reported incidences and 1.022 for fatalities. The outcomes demonstrate that LSTM outperforms other models for forecasting purposes.

Impact of Vaccination: Fig. 11 illustrate a progressive improvement in epidemic management and a leveling of their curves in relation to the number of daily reported and fatal cases. Due to multiple waves occurring at intervals, the number of incidences in various regions of the globe has been gradually rising. Large-scale vaccination programs are launched when the COVID-19 vaccine is released publicly at the beginning of 2021. Despite this, vaccine availability issues, vaccine reach among large populations, and the time required for vaccines to acquire virus immunity all contributed to the slow and early reduction of fatalities. Figure 11a and b shows that there seem to be fewer fatalities despite an increase in the incidence of Coronavirus variants such as omicron because of the development of an antiviral vaccination. The peak of incidences, which took place between days 700 and 800, can be seen in Fig. 11b. However, as shown in Fig. 11a, which emphasizes the vaccine’s achievement in halting the spread of the disease, shows that the number of fatalities has not increased during this time.

Different studies have been carried out to predict confirmed incidents and mortality. Xu et al. (2022) applied LSTM, CNN, and CNN-LSTM deep learning techniques to predict confirmed incidents and mortality regularly in India, Russia, and Brazil for the span from July 14 to July 31, 2021. For these three countries, the MAE of prediction using LSTM is 8949, 1198, and 15275, respectively. Although, the CNN-LSTM model’s MAE for these nations is 3214, 572, and 4321, respectively. Additionally, based on the RMSE and MAPE of various models from June 21 to July 10, 2021, Verma et al. (2022) predict the Coronavirus incidences for the next (7, 14, 21) days. The models’ RMSE and MAPE values were (4067.74, 4385.09, 4431.91) and (7.95, 8.1, 8.75), respectively. The proposed model shows minimal errors in regard to MAE, RMSE, and MAPE based on extensive experiments.

The computational complexity analysis measures how long the algorithm takes to process input for the studied models. Among the studied models, the training time complexity of VAR is $O(m*n^3)$, where m is the number of features and n is the number of training instances. Therefore, it works better for small datasets, but the time difference grows exponentially with the increase in the data size. For LSTM, the complexity is O(w) per time step, where w represents the number of weights. Further, the time complexity of SEIR-HDQV is $O(m*n)$ where m and n represent the number of features and the number of training instances, respectively.

The proposed model includes features like hospitalization, death, quarantine, and vaccination that have been ignored by other models in this domain. Each feature has been assigned weights and the information has been incorporated into the model. Further, meteorological data have also been analyzed to visualize the effect of weather on virus dissemination.

The models’ shortcomings become apparent when a pathogenic virus shows mutations since the extent of the epidemic becomes highly dynamic and vulnerable to unanticipated occurrences like lockdowns, vaccination circulation, and mutations. As a result of these drastic changes, model analysis, and projected future outcomes might become less reliable. The compartmental SEIR-HDQV model does not consider other elements that might affect the transmission of Coronavirus, such as environment, transportation limitations, and underlying medical problems in communities.

Temperature and Humidity Analysis:

The pathogenic virus morbidity may be influenced by environmental factors, such as temperature, relative humidity, and air pollution (Ma et al. 2020; Tosepu et al. 2020; Xie and Zhu 2020; Shrivastav and Jha 2021). The climate of various countries is extremely hostile because the infectious virus mutates differently under various environmental conditions. Since several geographical locations of a country may experience different temperatures and humidity, there is a lack of a global index for a nation’s temperature and humidity. During the Indian epidemic, this study has analyzed Coronavirus dissemination at diverse temperatures and humidity levels.

Geographic locations in India are selected for a wide range of temperatures and humidity levels, including Kerala, Delhi, Gujarat, and Bihar. These locations are located in distinct directions on the map. The hazardous virus transmission rate has been documented for each day ranging from March 14, 2020, to July 12, 2020, in the selected locations. Figures 12 and 13 show the prevalence of epidemics in the four distinct states when ambient temperature and relative humidity vary. The verified cases are stable in the presence of 28.5 °C and higher than ${50}{\hbox {Km h}}^{-1}$ relative humidity, as shown in Fig. 12a. Kerala’s reported incidences increased with a drop in temperature and a reduction in humidity, whereas the verified incidences spiked (peaked) when the temperature remained consistent, although the humidity varied. From March 14 to April 3, 2020, Delhi’s recorded cases remained consistent at 20–25 °C and high humidity. However, the reported incidences increased with humidity and temperature rise, as shown in Fig. 12b. Figure 13a shows a similar analysis and 13b for Gujarat and Bihar (states of west and east India, respectively).

6 Conclusion

This paper analyzes the performance of mathematical learning models, including SEIR-HDQV, VAR, and LSTM, used for forecasting incidences and fatalities over the ensuing days. The error metrics RMSE and MAE indicate a higher variance for the different countries, owing to the difference in population. A comparison of all models shows that LSTM has the lowest MAPE score, demonstrating that it outperforms all others. The global impact of pathogenic viral vaccinations indicates that when vaccine doses are delivered to individuals throughout this epidemic, the mortality rate drops even when reported incidences increase considerably. Furthermore, the effect of ambient temperature and relative humidity on reported incidences in India demonstrates that climatic factors influence pathogenic virus transmission. Several models may also be employed to foresee the number of hospitalized patients, quarantined patients, and so forth.

Future Directions: Future studies can examine the drop in the death rate during Coronavirus waves in nations where most of the population is vaccinated versus those where people were not adequately vaccinated. Further, researchers may focus on the vaccine’s lasting efficacy, the creation of booster shots, and its effect on recently developed variants. Additionally, researchers may study Coronavirus fatality and morbidity rates affected by levels of air pollution in distinct geographical locations. Furthermore, it might also be possible to create effective interventions in the future based on the enduring psychological consequences of the pandemic on individuals and communities.

Data availability

OWID-COVID dataset has been utilized to carry out experiments related to COVID-19 cases Cameron Appel and Beltekian (2019), while data related to temperature and relative humidity have been collected through Central Pollution Control Board (CPCB) Room (2022).

References

Agarwal N, Dutta R (2022) Comparative predictive analysis of mortality rate after COVID-19 vaccination using various machine learning approaches. In: 2022 international conference on computer communication and informatics (ICCCI). IEEE, pp 01–05
Alimadadi A, Aryal S, Manandhar I, Munroe PB, Joe B, Cheng X (2020) Artificial intelligence and machine learning to fight COVID-19. American Physiological Society Bethesda, MD
Book Google Scholar
Analysis ATS (2020) Partial autocorrelation. https://online.stat.psu.edu/stat510/lesson/2/2.2. Accessed 10 Nov 2020
Ardabili SF, Mosavi A, Ghamisi P, Ferdinand F, Varkonyi-Koczy AR, Reuter U, Rabczuk T, Atkinson PM (2020) COVID-19 outbreak prediction with machine learning. Algorithms 13(10):249
Article MathSciNet Google Scholar
Auler A, Cássaro F, Da Silva V, Pires L (2020) Evidence that high temperatures and intermediate relative humidity might favor the spread of COVID-19 in tropical climate: a case study for the most affected Brazilian cities. Sci Total Environ 729:139090
Article Google Scholar
Brandt PT, Williams JT (2007) Basic vector autoregression models. SAGE Publications, Inc, Thousand Oaks (English)
Cai C, Gou B, Khishe M, Mohammadi M, Rashidi S, Moradpour R, Mirjalili S (2023) Improved deep convolutional neural networks using chimp optimization algorithm for COVID-19 diagnosis from the X-ray images. Exp Syst Appl 213:119206
Article Google Scholar
Cameron Appel DG, Beltekian D (2019) Our world in data. https://github.com/owid/COVID-19-data/blob/master/public/data/README.md. Accessed 11 Dec 2020
Cheung Y-W, Lai KS (1995) Lag order and critical values of the augmented dickey-fuller test. J Bus Econ Stat 13(3):277–280
Google Scholar
Cotfas L-A, Delcea C, Roxin I, Ioanăş C, Gherai DS, Tajariol F (2021) The longest month: analyzing COVID-19 vaccination opinions dynamics from tweets in the month following the first vaccine announcement. IEEE Access 9:33203–33223
Article Google Scholar
Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7):1895–1923
Article Google Scholar
Ghostine R, Gharamti M, Hassrouny S, Hoteit I (2021) An extended SEIR model with vaccination for forecasting the COVID-19 pandemic in Saudi Arabia using an ensemble Kalman filter. Mathematics 9(6):636
Article Google Scholar
Grasselli G, Pesenti A, Cecconi M (2020) Critical care utilization for the COVID-19 outbreak in Lombardy, Italy: early experience and forecast during an emergency response. JAMA 323(16):1545–1546
Article Google Scholar
Graves A, Fernández S, Schmidhuber J (2005) Bidirectional LSTM networks for improved phoneme classification and recognition. In: International conference on artificial neural networks. Springer, pp 799–804
Gupta A, Pradhan B, Maulud KNA (2020) Estimating the impact of daily weather on the temporal pattern of COVID-19 outbreak in India. Earth Syst Environ 4(3):523–534
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hu T, Khishe M, Mohammadi M, Parvizi G-R, Karim SHT, Rashid TA (2021) Real-time COVID-19 diagnosis from X-ray images using deep CNN and extreme learning machines stabilized by chimp optimization algorithm. Biomed Sig Process Control 68:102764
Article Google Scholar
Irfan M, Razzaq A, Suksatan W, Sharif A, Elavarasan RM, Yang C, Hao Y, Rauf A (2022) Asymmetric impact of temperature on COVID-19 spread in India: evidence from quantile-on-quantile regression approach. J Therm Biol 104:103101
Article Google Scholar
Johnstone IM (2001) On the distribution of the largest eigenvalue in principal components analysis. Ann Stat 29(2):295–327
Article MathSciNet Google Scholar
Kanpur (2020) SUTRA model. https://www.iitk.ac.in/new/data/innovations-on-covid-19.pdf. Accessed 10 July 2020
Karim R, Akter N (2022) Effects of climate variables on the COVID-19 mortality in Bangladesh. Theor Appl Climatol 150(3):1463–1475
Article Google Scholar
Liu J, Lu S, Lu C (2021) Exploring and monitoring the reasons for hesitation with COVID-19 vaccine based on social-platform text and classification algorithms. In: Healthcare, vol 9. Multidisciplinary Digital Publishing Institute, p 1353
Ma Y, Zhao Y, Liu J, He X, Wang B, Fu S, Yan J, Niu J, Zhou J, Luo B (2020) Effects of temperature variation and humidity on the death of COVID-19 in Wuhan, China. Sci Total Environ 724:138226
Article Google Scholar
Mao N, Zhang D, Li Y, Li Y, Li J, Zhao L, Wang Q, Cheng Z, Zhang Y, Long E (2023) How do temperature, humidity, and air saturation state affect the COVID-19 transmission risk?. Environ Sci Pollut Res 30(2):3644–3658
Article Google Scholar
Maurya S, Singh S (2020) Time series analysis of the COVID-19 datasets. In: 2020 IEEE international conference for innovation in technology (INOCON). IEEE, pp 1–6
Miralles-Pechuán L, Jiménez F, Ponce H, Martínez-Villaseñor L (2020) A deep q-learning/genetic algorithms based novel methodology for optimizing COVID-19 pandemic government actions. arXiv preprint arXiv:2005.07656
Mishra S, Singh T, Kumar M (2022) COVID-19 Short Term Forecasting using LSTM
Mohammadi FG, Shenavarmasouleh F, Amini MH, Arabnia HR (2020) Impact of weather conditions on the COVID-19 pandemic in the united states: a big data analytics approach. In: 2020 international conference on computational science and computational intelligence (CSCI). IEEE, pp 418–423
Mousavi M, Salgotra R, Holloway D, Gandomi AH (2020) COVID-19 time series forecast using transmission rate and meteorological parameters as features. IEEE Comput Intell Mag 15(4):34–50
Article Google Scholar
Ong E, Wong MU, Huffman A, He Y (2020) COVID-19 coronavirus vaccine design using reverse vaccinology and machine learning. Front Immunol 11:1581
Article Google Scholar
Organization WH (2019) Coronavirus Disease (COVID-19). https://www.who.int/emergencies/diseases/novel-coronavirus-2019. Accessed 15 Feb 2020
Randhawa GS, Soltysiak MP, El Roz H, de Souza CP, Hill KA, Kari L (2020) Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study. PLoS ONE 15(4):0232391
Article Google Scholar
Rao ASS, Vazquez JA (2020) Identification of COVID-19 can be quicker through artificial intelligence framework using a mobile phone-based survey when cities and towns are under quarantine. Infect Control Hosp Epidemiol 41(7):826–830
Article Google Scholar
Rashed EA, Kodera S, Gomez-Tames J, Hirata A (2020) Influence of absolute humidity, temperature and population density on COVID-19 spread and decay durations: multi-prefecture study in Japan. Int J Environ Res Public Health 17(15):5354
Article Google Scholar
Rauf HT, Gao J, Almadhor A, Arif M, Nafis MT (2021) Enhanced bat algorithm for COVID-19 short-term forecasting using optimized LSTM. Soft Comput 25(20):12989–12999
Article Google Scholar
Room CC (2022) Central Control Room for Air Quality Management-All India. https://app.cpcbccr.com/ccr/caaqm-dashboard-all/caaqm-landing/caaqm-comparison-data. Accessed 13 June 2022
Sadik R, Reza ML, Al Noman A, Al Mamun S, Kaiser MS, Rahman MA (2020) COVID-19 pandemic: a comparative prediction using machine learning. Int J Autom Artif Intell Mach Learn 1(1):1–16
Google Scholar
Saffari A, Khishe M, Mohammadi M, Hussein Mohammed A, Rashidi S (2022) Dcnn-fuzzywoa: artificial intelligence solution for automatic detection of COVID-19 using X-ray images. Comput Intell Neurosci
Shastri S, Singh K, Kumar S, Kour P, Mansotra V (2021) Deep-LSTM ensemble framework to forecast COVID-19: an insight to the global pandemic. Int J Inf Technol 13(4):1291–1301
Google Scholar
Shrivastav LK, Jha SK (2021) A gradient boosting machine learning approach in modeling the impact of temperature and humidity on the transmission rate of COVID-19 in India. Appl Intell 51(5):2727–2739
Article Google Scholar
Singh T, Sharma N, Singh S, Kumar M (2023) Analysis and forecasting of air quality index based on satellite data. Inhal Toxicol 35:1–16. https://doi.org/10.1080/08958378.2022.2164388
Article Google Scholar
Subudhi S, Verma A, Patel AB, Hardin CC, Khandekar MJ, Lee H, McEvoy D, Stylianopoulos T, Munn LL, Dutta S et al (2021) Comparing machine learning algorithms for predicting ICU admission and mortality in COVID-19. NPJ Dig Med 4(1):87
Article Google Scholar
Tosepu R, Gunawan J, Effendy DS, Lestari H, Bahar H, Asfian P et al (2020) Correlation between weather and COVID-19 pandemic in Jakarta, Indonesia. Sci Total Environ 725:138436
Article Google Scholar
Tuli S, Tuli S, Tuli R, Gill SS (2020) Predicting the growth and trend of COVID-19 pandemic using machine learning and cloud computing. Internet Things 11:100222
Article Google Scholar
Ustebay S, Sarmis A, Kaya GK, Sujan M (2023) A comparison of machine learning algorithms in predicting COVID-19 prognostics. Intern Emerg Med 18(1):229–239
Article Google Scholar
Verma H, Mandal S, Gupta A (2022) Temporal deep learning architecture for prediction of COVID-19 cases in India. Exp Syst Appl 195:116611
Article Google Scholar
Vrabac D, Shang M, Butler B, Pham J, Stern R, Paré PE (2021) Capturing the effects of transportation on the spread of COVID-19 with a multi-networked SEIR Model. In: 2021 American control conference (ACC). IEEE, pp 3152–3157
Wang J, Tang K, Feng K, Lv W et al (2020) High temperature and high humidity reduce the transmission of COVID-19. Avail SSRN 3551767:2020
Google Scholar
Wang Y, Liu M, Guo Y, Shu F, Chen C, Chen W (2023) Cumulative diversity pattern entropy (cden): a high-performance, almost-parameter-free complexity estimator for nonstationary time series. IEEE Trans Ind Inform
WHO (2020) COVID-19 vaccine advice. https://www.who.int/emergencies/diseases/novel-coronavirus-2019/COVID-19-vaccines/advice. Accessed 15 Feb 2020
Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemom Intell Lab Syst 2(1–3):37–52
Article Google Scholar
Worldometer (2020) Coronavirus statistics. https://www.worldometers.info/coronavirus/. Accessed 8 Dec 2022
Wu Y, Jing W, Liu J, Ma Q, Yuan J, Wang Y, Du M, Liu M (2020) Effects of temperature and humidity on the daily new cases and new deaths of COVID-19 in 166 countries. Sci Total Environ 729:139051
Article Google Scholar
Wu Y, Zhao X, Li Y, Guo L, Zhu X, Fournier-Viger P, Wu X (2023) Opr-miner: order-preserving rule mining for time series. IEEE Transactions on Knowledge and Data Engineering
Xiao Z, Xu X, Xing H, Luo S, Dai P, Zhan D (2021) Rtfn: a robust temporal feature network for time series classification. Inf Sci 571:65–86
Article MathSciNet Google Scholar
Xie J, Zhu Y (2020) Association between ambient temperature and COVID-19 infection in 122 cities from China. Sci Total Environ 724:138201
Article Google Scholar
Xing H, Xiao Z, Zhan D, Luo S, Dai P, Li K (2022a) SelfMatch: robust semisupervised time-series classification with self-distillation. Int J Intell Syst 37(11):8583–8610
Article Google Scholar
Xing H, Xiao Z, Qu R, Zhu Z, Zhao B (2022b) An efficient federated distillation learning system for multitask time series classification. IEEE Trans Instrum Meas 71:1–12
Google Scholar
Xu L, Magar R, Farimani AB (2022) Forecasting COVID-19 new cases using deep learning methods. Comput Biol Med 144:105342
Article Google Scholar
Xue J, Wu S, Huang Q, Zhao L, Sarlis NV, Varotsos PA (2023) RASE: a real-time automatic search engine for anomalous seismic electric signals in geoelectric data. IEEE Trans Geosci Remote Sens 61:1–11
Google Scholar
Yan L, Zhang H-T, Xiao Y, Wang M, Sun C, Liang J, Li S, Zhang M, Guo Y, Xiao Y et al (2020) Prediction of survival for severe COVID-19 patients with three clinical features: development of a machine learning-based prognostic model with clinical data in Wuhan. medRxiv

Download references

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Suryanshi Mishra, Tinku Singh, Manish Kumar, and Satakshi have contributed equally to this work.

Authors and Affiliations

Department of Mathematics and Statistics, SHUATS, Prayagraj, U.P., India
Suryanshi Mishra & Satakshi
Department of IT, Indian Institute of Information Technology Allahabad, Prayagraj, U.P., India
Tinku Singh & Manish Kumar

Authors

Suryanshi Mishra
View author publications
You can also search for this author in PubMed Google Scholar
Tinku Singh
View author publications
You can also search for this author in PubMed Google Scholar
Manish Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Satakshi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Satakshi.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mishra, S., Singh, T., Kumar, M. et al. Multivariate time series short term forecasting using cumulative data of coronavirus. Evolving Systems 15, 811–828 (2024). https://doi.org/10.1007/s12530-023-09509-w

Download citation

Received: 16 February 2023
Accepted: 12 May 2023
Published: 04 June 2023
Issue Date: June 2024
DOI: https://doi.org/10.1007/s12530-023-09509-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multivariate time series short term forecasting using cumulative data of coronavirus

Abstract

Similar content being viewed by others

Assessing Spatiotemporal Transmission Dynamics of COVID-19 Outbreak Using AI Analytics

Forecasting COVID-19 Cases in Morocco: A Deep Learning Approach

Predicting Future, Past, and Misinterpreted COVID-19 Cases Using Bidirectional LSTM Model for Proper Health Monitoring: Advances in Data Analytics

1 Introduction

2 Literature survey