Introduction

Throughout history, it is evident that different contagious diseases have claimed the lives of many people and caused difficult conditions that take a long period to conquer the situation. In the past, the surge of smallpox has killed roughly 500 million people all over the world [1]. In 1918, an approximate of 17–100 million individuals has been killed due to the epidemic of Spanish influenza [2]. Several pandemics have been emerging from the last 20 years like severe acute respiratory syndrome coronavirus (SARS-CoV) in the year 2002–2003, H1N1 influenza in the year 2009, and the Middle East respiratory syndrome coronavirus (MERS-CoV) in the year 2015. The outbreak of novel coronavirus since December 2019 in the city of Wuhan in South China has killed above hundreds and infected more than thousands of individuals within the first few days of the pandemic. The human coronaviruses that have originated from the animal reservoirs in the twenty-first century lead to a global epidemic with frightening morbidity and mortality. These viruses are named corona due to the appearance of a spike-like morphology on the external area under the electronic microscope. It is composed of single-stranded RNA belonging to the Coronavirinae subfamily, which belongs to Coronaviridae family. α, β, γ and δ are the four genera of these viruses. Mammals are usually infected by α- and β-CoV, while the birds are infected by γ- and δ-CoV. Less pathogenicity and mild respiratory syndrome as the common cold are caused by the HCoV-229E and HCoV-NL63 of alpha coronavirus and HCoV-HKU1 and HCoV-OC43 of beta-coronavirus. While, severe and malignant breathing infections are exhibited by the SARS-CoV and MERS-CoV of β-CoVs [3].

In December 2019, local hospitals in City of Wuhan in South China were reported with people diagnosed with unidentified pneumonia [4]. All the people diagnosed with unidentified pneumonia were connected to the Huanan Seafood Market where varieties of live species are available. The symptoms of these cases are similar to the clinical characteristics of pneumonia caused by virus. On 7 January 2020, the Centers for disease control (CDC) experts after analyzing samples gathered from the throat swabs, declared the disease as novel coronavirus pneumonia (NCP) [5]. Later, the ICTV (International Committee on Taxonomy of Viruses) named the novel virus as SARS-CoV-2 (Severe Acute Respiratory Syndrome Coronavirus 2) [6, 7]. On 11 February 2020, the World Health Organization (WHO) declared the disease as novel COVID-19 [8]. The COVID-19 induced by SARS-CoV-2 associates to β-CoV. The genome structure of SARS-CoV-2 exhibits 79.5% similarity to SARS-CoV as it sustains eight residues of the SARS-CoV-binding residues [9]. As the SARS-CoV-2 genome sequencing exhibits 96.2% similarity to Bat Coronavirus RaTG13, both the bat coronavirus and human SARS-CoV-2 use the similar ancestor [10]. On 30 January 2020, the WHO announced the surge as a Public Health Emergency of International Concern (PHEIC) after the dissemination of COVID-19 to 18 countries as a result of person-to-person contact. In the United States, the major crisis was established when they identified the first case that was not carried from China on 26 February 2020. When the number of COVID-19 infections has raised 13 times in different regions of the world other than China and when the number of countries affected by COVID-19 has tripled, then the WHO has announced COVID-19 as pandemic on 11 March 2020 as it causes serious threat to the public health all over the world. The number of COVID-19 cases registered in different countries of the world has crossed all the previous records of other pandemics over time. It is considered the most dangerous disease till date due to its rapid transmission [11].

The first COVID-19 case in India was registered on January 30, 2020. In the month of March, the total of COVID-19 infections started escalating. Most of these cases are connected to the people having travel history to other countries that are affected by COVID-19 [12]. The Indian government has implemented strict actions by suspending all visas to India with effect from 13 March 2020. As of May 14, 2020, the cumulative number of registered cases in India is 81,997. The number of daily registered COVID-19 cases in India up to May 14, 2020 over 7 days MA is shown in the following Fig. 1.

Fig. 1
figure 1

No of daily cases registered in India up to May 14, 2020

From the past decades, the progress in sensor technology, biological understanding, and mathematical techniques is contributing to the growing significance of modeling in the field of health and bioinformatics. A mathematical model can be described as a depiction of a system utilizing mathematical notions and language to facilitate appropriate interpretation of a system or to analyze the influence of different elements and to generate predictions on patterns of behavior [13]. As mathematical modeling activity insists transparency and certainty regarding inferences, it enables us to evaluate our understandings of the epidemiology of infection by correlating model results with the recognized patterns. In the field of medicine, mathematical models are suitable for performing research on epidemiology, planning, and assessment of precautionary and control programs, clinical investigations, health and cost–benefit analysis, investigation of patients and in maximizing the efficacy of operations directed in attaining stated goals with existing resources [14]. A statistical model incorporates a set of statistical assumptions to approximate reality and to make predictions from these approximations. The advantage of statistical models is that it summarizes the results of a test and presents them in such a way so that one can more easily see and understand any patterns within the data. The usage of a statistical model allows clinical analysts to obtain moderate and accurate assumptions from gathered information and to make reliable decisions in the existence of ambiguity.

A mathematical procedure known as decomposition method has been suggested by Adomian [15] to provide solutions to the problems of neuroscience, such as the conduction of nerve impulses, analyzing the behavior of the immune system or observation of medication effects, and so on. Further, the results demonstrate the accuracy and efficacy of the proposed method. A mathematical model to predict whether isolation and quarantine can stop the spread of SARS has been developed by Castillo-Chavez et al. [16]. The amount of data required to predict SARS has been reduced due to the simplicity of questions and assumptions in the proposed model. Further, results indicate that the recommended model can reduce the size of the SARS outbreak by a factor of 1000. To determine the risk of non-immune persons obtaining dengue when traveling, a mathematical model has been represented by Massad et al. [17]. Further, the model is tested using Singapore data and the results depict the robustness of the proposed mathematical model in predicting the risk of getting dengue when traveling to countries having dengue-endemic. To forecast the spread of infectious diseases like dengue, two statistical models, namely ARIMA model and the Knorr–Held two-component (K–H) model, have been suggested by Earnest et al. [18]. The proposed models have been validated on Singapore dengue fever data. Further, the performance of the models has been distinguished with the Mean Absolute Percentage Error (MAPE). The results show that the K–H model results in a lesser MAPE value of 17.21 and takes a longer time to execute when compared to the ARIMA model. To analyze clinical data and more complicated data, the concept of linear and logistic regressions along with a modern statistical model known as Bayesian networks has been described by Yoo et al. [19]. Using the modern statistical model, the interactions among clinical, genomic, and environmental data have been represented. Further, it is also concluded that the modern statistical model outperforms in analyzing both clinical and complicated data. To analyze tuberculosis epidemiology, a statistical model named a Bayesian model has been proposed by Getoor et al. [20]. Statistical relation models which are constructed using a data-driven method are used to model distributions over relational domains. The model has been applied to the San Francisco tuberculosis patient data. Further, results indicate the potentiality of the proposed model over other conventional statistical approaches.

From the past few pandemics, the assessment of human loss and the prediction of mortality rate until certain period or closure of the pandemic has been performed successfully using the statistical models. In the present pandemic, researchers and technocrats have been using the same statistical procedures in the assessment of spread rate and mortality rate as these models show better performance in the prediction of earlier epidemics. The statistical model based on multivariate analysis has been proposed by Xu et al. [21] to determine the false-negative results as well as window period for testing positive. This model is used to determine the clinical symptoms that are important for detecting the false-negative results of SARS-CoV-2. Moreover, a prediction model based on the clinical characteristics has been proposed to identify the right time for testing. Further, the findings show that the proposed model provides better accuracy in the clinical diagnosis of the COVID-19 pandemic. To estimate the dynamics of disease transmission over time, a statistical model combined with data of COVID-19 cases in Wuhan has been proposed by Kucharski et al. [22]. The proposed model has been evaluated on publicly available datasets on cases in Wuhan as well as on the International cases exported from Wuhan. Based on the findings, the authors concluded that there will be a decline in the transmission of COVID-19 in Wuhan during late January 2020. An analysis based on Boltzmann’s function to predict the number of deaths in China has been proposed by Gao et al. [23]. From the findings, it can be concluded that the assessment of the severity of the situation can be better predicted using the proposed method. To calculate the real number of contaminated people and to assume the infection fatality ratio (IFR), a novel mechanistic statistical model combined with the SIR (Susceptible, Infected and Recovered) has been proposed by Roques et al. [24]. The findings show that the IFR is compatible with the earlier findings in China (0.66%) and lesser than the earlier computed value on the Diamond Princess Cruise ship data (1.3%). A statistical model based on Holt’s second-order exponential smoothing method and ARIMA model has been proposed by Poonia and Azad [25] to forecast COVID-19 infected patients in 28 states and 5 union territories of India. From the results, it can be observed that the cumulative number of cases in India will increase to 36,335.63 and simultaneously the mortality rate may increase to 1099.38 by 1 May 2020. The other analysis done on the applicability of mathematical and statistical models has been depicted in the following Table 1.

Table 1 Applicability of mathematical and statistical models in the prognosis and forecasting of COVID-19 disease

Besides the successful implementation of statistical models in the prognosis and forecasting of the COVID-19 pandemic, yet certain limitations exist. The Moving-Average model performs well with stationary data. This model does not consider the trend or seasonality of time series data. In the Auto-regressive model, the assumption of uncorrelated error is easily violated as the independent variables are time-lagged values for the dependent variable. With the ARIMA model, the long-term forecasting generates poor prediction results. Although ARIMA model is the mostly used model for forecasting the time series, there are certain limitations of the model. The limitations of the ARIMA model are: (i) it does not have automatic updating feature as in smoothing models. Due to this reason, the entire modeling process has to be repeated from the beginning whenever new data are available, (ii) the likeness of ARIMA model to solve complex real-world problem is not always adequate as ARIMA models cannot handle the non-linear patterns [26], (iii) it does not provide support for changes in the middle of the prediction phases [27]. Therefore, in this paper, we propose Holt’s–Winter model for forecasting the time series data with seasonal and trend patterns. Holt–Winters method is a time-series forecasting method that is used to extract and interpret data and statistics and portray results to more precisely forecast the future trend based on past data.

Proposed Methodology

In the time series analysis, error trend seasonality forecast (ETS), ARIMA and Holt–Winters are the main classical models that have been widely used as predictors. Holt’s–Winter is a statistical model also called as triple exponential smoothing model used for short-term forecasting with seasonal and trend patterns. In Holt–Winters model, components, such as level, trend, and the season, are necessary for forecasting. The value of these components ranges between 0 and 1. Based on the pattern of the season, Holt–Winters model is classified as an additive model and multiplicative model. The additive method is considered when the variations in the season are constant throughout the series, while the multiplicative method is considered when the variations in the season change relative to the level of series. If the seasonal effect is independent of the prevailing mean level of the time series, then Holt–Winters additive model is used. If the seasonal effect is dependent on the mean level of the time series, i.e., the seasonal variations rise with the rise in mean level of time series, then Holt–Winter multiplicative model is used [28]. In COVID-19 time series data, trends can be observed due to the repetition of certain patterns on regular intervals of time because of external factors like lockdown of country, mandatory social distancing, quarantines, etc. Therefore, in this research, multiplicative method has been considered as the variations in COVID data are quite frequent. In this method, the seasonal components are communicated in relative terms, such as percentages, and the series are seasonally balanced by isolating through the seasonal component. Algorithm 1 represents the procedure of Holt–Winters model for COVID forecasting. The algorithm of Holt–Winters multiplicative model makes use of state space model to provide exponential smoothing that is similar to the statistical foundations used in the regression and Box/Jenkins methodology [28]. In Holt–Winters multiplicative model, the relationship to Holt–Winters multiplicative smoothing equations is revealed by providing equivalent exponential smoothing equations for the transition equations of level and trend. The observation equation represented as “\({y}_{t}\)” is used to disclose the relationship between time series and state variables. The parameter “\({p}_{t}\)” represents the level for time series, “\({q}_{t}\)” represents the growth per period and “\({r}_{t}\)” represents the seasonal factor. The error term is represented by “\({\varepsilon }_{t}\)” which are independent of the past value of time series and state variables. The parameters “\({\widehat{p}}_{t}\)” “\({\widehat{q}}_{t} \mathrm{and} {\widehat{r}}_{t}\)” represent the Holt–Winters multiplicative smoothing equations. The parameter “\(m\)” is used to represent the frequency of seasonality that is the number of seasons in that particular year. The framework of the proposed work is represented in Fig. 2.

Fig. 2
figure 2

Framework of the proposed method

figure a

Experimental Setup

This research has been experimented on system setup with Lenovo T520 with Windows 10 Operating System and Intel Core i5 processor. The system is having 6 GB RAM. For experimentation, we have used python with packages, such as NumPy, pandas framework. Seaborn, Matplotlib modules are used for data visualization. For statistical analysis of the data in the models, statsmodel package has been used. In the study, proposed Holt–Winters model is compared with various statistical models, such as Holt’s Linear, MA, AR and ARIMA model. The data set is further divided into 85:15 ratio where 85 percent of data, i.e., the time period from 30 Jan 2020 to 28 Apr 2020 is used for training the model and 15 percent of data, i.e., the time period from 29 April 2020 to 14 May 2020 has been used for testing the model. The parameter setting of these models is displayed in Table 2.

Table 2 Parameter setting of various models with different states

Data Preprocessing

Feeding data and preparation of valid data are the primary steps in building a model. In this study, we considered the data of patients of different states in India from Covid19india.org [29] from January 30, 2020 to May 13, 2020. Data contain features, such as Date, Age Bracket, Gender, Patient_Status, City, District, State, State code, Notes, Nationality, Source_1, Source_2, Source_3. The data are imported as data frame using web scraping. Segregate the data frame with respective to date by considering the state and status, such as confirmed and recovered, divided the count into confirmed, deceased, and death. Finally, data are ready with columns Date, Name of State/UT, Latitude, Longitude, Total Confirmed cases, Death, Cured/Discharged/ Migrated, New cases, New deaths, New recovered. Data are considered up to 14 May 2020. We considered the data with Name of State / UT as Maharashtra, Tamil Nadu, Delhi, Gujarat, and Andhra Pradesh and applied various models.

Performance Measures

In this section, we discussed the evaluation measure Root Mean Square Error (RMSE), Mean Squared Error (MSE), Mean Absolute Percentage Error (MAPE) and Mean Absolute Error (MAE) which are used as a primary measure to evaluate the performance of the models. RMSE, MSE, MAPE, MAE are the standard metrics in regression and are helpful to know the efficiency of the model in terms of error rate. RMSE is calculated as the square root of the mean value of the squared difference between predictions and actual outcomes as shown in Eq. (11).

$$\mathrm{RMSE}=\sqrt{\sum \frac{{({\mathrm{pred}}_{i}-{\mathrm{actual}}_{i})}^{2}}{\mathrm{Total Predictions}}}$$
(11)

MSE is used to determine the average squared difference between the estimated and actual outcomes as shown in Eq. (12). The total number of predictions is indicated by ‘n’ in Eq. (12).

$$\mathrm{MSE}=\frac{1}{n}\sum_{i=1}^{n}{({\mathrm{pred}}_{i}-{\mathrm{actual}}_{i})}^{2}$$
(12)

MAE is used to determine the errors among paired observations signifying the similar circumstance. In Eq. (13), ‘n’ indicates the total number of predictions

$$\mathrm{MAE}=\frac{{\sum }_{i=1}^{n}({\mathrm{pred}}_{i}-{\mathrm{actual}}_{i})}{n}$$
(13)

MAPE is standard loss function used to denote the prediction accuracy of forecasting as displayed in Eq. (14). The total number of predictions is represented using ‘n’ in the Eq. (14). The absolute value in this calculation is summed for every forecasted point in time and divided by the number of fitted points n.

$$\mathrm{MAPE}=\frac{1}{ n}\sum_{i=1}^{n}\left|\frac{({\mathrm{actual}}_{i}-{\mathrm{pred}}_{i})}{{\mathrm{actual}}_{i}}\right|$$
(14)

Environmental Setup

Here, we discussed the parameter setting of statistical models of Holt–Winters, Holt’s Linear, MA, AR and ARIMA model for various states of India like Andhra Pradesh, Maharashtra, Gujarat, Delhi, and Tamil Nadu. The performance of the model is evaluated using the RMSE. Parameter setting in each classifier including bagging is depicted in Table 2.

Results Analysis

The first case in India was reported on January 30, 2020. In the month of February, only three cases were reported and it remained constant throughout the month. From the month of March 2020, the number of cases started increasing steadily. To predict the dynamics of transmission, different time-series statistical models, such as Holt–Winters, Holt’s Linear, MA, AR and ARIMA model are simulated on the data that are based on the statistics of India COVID19 [29]. Using above-mentioned models, we forecast the number of confirmed cases for Andhra Pradesh, Maharashtra, Gujarat, Delhi, Tamil Nadu data up to June 21, 2020. The RMSE scores of the Holt–Winters, Holt’s linear, MA, AR and ARIMA model are shown in Table 3.

Table 3 RMSE, MSE, MAE, and MAPE Scores of various models w.r.t. (a) Andhra Pradesh, (b) Maharashtra, (c) Gujarat, (d) Delhi, and (e) Tamil Nadu

From the Table 3a–d, it can be observed that RMSE, MSE, MAE and MAPE values of Holt–Winters model are less when compared to the RMSE, MSE, MAE and MAPE of other models, such as Holt’s Linear, AR, MA and ARIMA model. In Table 3e, it is noted that RMSE, MSE, MAE and MAPE of ARIMA model are less when compared to Holt–Winters model. Further, regarding Holt–Winters model and ARIMA model, Holt–Winters model performed well in four states while the ARIMA model performed well in only one state when compared to Holt–Winters model. Therefore, from the statistics of Andhra Pradesh, Maharashtra, Gujarat, Delhi, it can be stated that RMSE, MSE, MAPE and MAE scores of Holt–Winters statistical model significantly performed well when compared to the scores of Holt’s Linear, AR, MA and ARIMA models. Table 4 shows some of the predictions of total number of cases using Holt–Winters, Holt’s Linear, AR, MA and ARIMA model with respect to Andhra Pradesh, Maharashtra, Gujarat, Delhi and Tamil Nadu. The predictions are computed up to June 21, 2020. The actual number of cases registered has been depicted in Table 5.

Table 4 Forecast of total number of COVID-19 cases in Andhra Pradesh, Maharashtra, Gujarat, Delhi, and Tamil Nadu
Table 5 Actual registered cases in Andhra Pradesh, Maharashtra, Gujarat, Delhi and Tamil Nadu

From Tables 4 and 5, it can be noted that the forecast of COVID-19 predicted cases of Holt–Winters model is in proximity with actual values of the registered cases in Andhra Pradesh, Maharashtra, Gujarat, Delhi and Tamil Nadu states. Therefore, it can be concluded that Holt–Winters model performed better predictions of COVID-19 when compared to the other models.

The prediction of number of cases can also be inferred from Fig. 3a–e, which presents the capacity and pattern of each model in the prediction of actual values of COVID-19 cases for Andhra Pradesh, Maharashtra, Gujarat, Delhi, and Tamil Nadu individually. From the Fig. 3a–d, it can be observed that the prediction values of Holt–Winters model, which is represented using red tick, are nearer to the actual validation values of the trained model which are represented using green ticks. In Fig. 3e, the prediction value of ARIMA model, which is represented by yellow ticks, is nearer to the actual validation values of the trained model.

Fig. 3
figure 3

Performance of various models a Andhra Pradesh, b Maharashtra, c Gujarat, d Delhi, and e Tamil Nadu

Figure 4a–e represents the prediction of number of confirmed cases by various time series models for Andhra Pradesh, Maharashtra, Gujarat, Delhi, and Tamil Nadu, respectively, from May 15, 2020 to June 21, 2020. From the Fig. 4a–e, it can be inferred that the predictions of number of COVID-19 confirmed cases by the Holt–Winters model are nearer to the actual value of COVID-19 cases when compared to the other models.

Fig. 4
figure 4

The outbreak prediction of a Andhra Pradesh, b Maharashtra, c Gujarat, d Delhi, and e Tamil Nadu

From the above observations, it can be concluded, the Holt–Winters model showed better performance as compared to Holt’s Linear, MA, AR and ARIMA model because Holt–Winters model obtains less RMSE, MSE, MAPE, MAE values when compared to other standard models. Moreover, Holt–Winters model performed better than other models and indicated efficient outcomes in terms of RMSE, MSE, MAPE and MAE on the training of 85% data. To statistically validate the results, Friedman test [30] has been performed among the obtained results of all the models over all the considered datasets.

This test considers the average results of all the models in form of ranks (assigned in ascending order as per the performance) [31] and is a non-parametric test. A null hypothesis, “the entire models have similar performance and their differences are merely random”, has been considered for conducting this test. Table 6 indicates the assigned ranks (in brackets) to all the models w.r.t. the datasets. By considering all the parameters of Friedman test, “\({X}_{F}^{2}\)” has been evaluated as 16.31. After obtaining \({X}_{F}^{2}\), the \({F}_{F}\) statistic is computed and found to be 10.615. Finally, the critical value is obtained 5.19 which is computed from the \({F}_{F}\) statistic and degree of freedom by setting \(\alpha\) = 0.05 (significance level). The null hypothesis is rejected as the obtained critical value (5.19) is found smaller than the \({F}_{F}\) statistic (10.615). Here, the details for process of calculation of the Friedman rank \({X}_{F}^{2}\), \({F}_{F}\) statistic, and critical value can be found [32, 33]. Hence, the proposed model’s performance and result are statistically significant and better as compared to other models under the studies. From the test results, it is observed that the performance of the proposed model is statistically significant as compared to other models. By combining the results from the performance metrics, table, and graphs, it is evident that the Holt–Winters method is an efficient model to fit the following growing trend when compared to the other models, such as Holt's Linear, MA, AR and ARIMA models, in forecasting the number of confirmed cases.

Table 6 Assigned Friedman’s rank to the considered models

Conclusion

Since the first case of COVID-19 in India, the number of registered cases is steadily growing and imposing a great threat to public health in India. In this paper, we employed the Holt–Winters model for forecasting the number of COVID-19 cases in Maharashtra, Tamil Nadu, Gujarat, Delhi, and Andhra Pradesh states of India up to June 21, 2020. The future number of cases has been predicted by analyzing the data from January 30, 2020 to May 14, 2020. The performance of the model has been evaluated using RMSE and the analysis shows that Holt–Winters method has less RMSE, MSE, MAPE and MAE value and generates more accurate predictions when compared with the RMSE, MSE, MAPE and MAE value of Holt’s Linear, AR, MA and ARIMA models. From the analysis, it can be predicted that the number of cases in other states of India may also increase in the near future. Based on the predictions, the government has to employ strict policies, such as awareness programs, imposing strict lockdown, etc. to prevent the spread of transmission. Moreover, the government also has to implement necessary measures for enhancing the medical facilities throughout India.