Introduction

The coronavirus case was emerged in Wuhan city, China in December 2019 [1]. Initially, anonymous pneumonia case was reported which was studied by respiratory samples and experts declared as pneumonia. It was later known as coronavirus pneumonia due to a novel coronavirus [2]. The World Health Organization officially stated the disease ‘COVID-19.′ It spreads very fast globally. There were 17,106,007 confirmed cases and 668,910 deaths globally as of July 31, 2020 [3]. COVID-19 becomes global threat for public health [4]. In India, the first active case was reported on 31-01-2020, but there was zero death recorded in January and February month. The major outbreak happened in the month of March. In this month, total cases reported was 1251 and total death was recorded as 32 [5]. This results in the public curfew on 22-03-2020 by the Government of India. After two days, the Government of India imposed 21 days of lockdown that was from 25-03-2020 to 14-04-2020. In the month of April, the situation was little serious as the total active cases were 33,050 and total death was 1074. Total new cases reported in this month were 31,801. This results in the second lockdown of 18 days starting from 15-04-2020 to 03-05-2020. India reached an alarming stage in the month of May. The total active cases of coronavirus in India reached 165,799 which was very threatening data. 133,998 new cases were reported till 29-05-2020. Total deaths reached 4706 in this month which is very disturbing for the people. This results in the third lockdown from 04-05-2020 to 17-05-2020 (14 days), followed by fourth lockdown starting from 18-05-2020 to 31-05-2020 (14 days). The Government of India has taken the key decision for imposing the lockdowns in the country. Surely, these lockdowns affect the Indian economy a lot. But life always comes first. If these lockdowns were not imposed then might be possible that these figures will be very threatening and India will be on the top of the world’s list in the active cases. This paper presents data analysis and visualization of India using Python programming. Further sections will explain the used methodology of AR and ARIMA models, case studies and discussion.

Machine Learning Model used for Time Series Analysis

AR and ARIMA time series methods are a univariate (single vector) time series without a trend and seasonal component, while other time series methods are multivariate with trend and seasonal component. Both the methods are the basic and easy to understand by using Python. This typically used allows the model to rapidly adjust for the sudden changes in the trends, resulting in more accurate forecasts other than various time series forecast methods. AR and ARIMA models are used as both give the accurate forecasts as compared to other models available in time series analysis. The dataset is used as a linear function of the observations at prior time steps, hence AR and ARIMA models give more precise, accurate and suitable forecast other than various methods. The following tools have been used for the analysis of AR and ARIMA models:

Machine Learning

In simple words, developers make the machine to learn and improve from ‘experience.’ Here, experience is the previous data based on which predictions will be made. It is an application of artificial intelligence. This paper explains the time series analysis as a machine learning model for the future predictions.

Time Series Analysis

This paper explains the ARIMA model for the time series analysis. This model works for the supervised learning in which previous data are present. There are other algorithms present for the supervised learning like linear regression, logistic regression, while time series analysis is a statistical technique that deals with the time series data or trend analysis. Time series data means that data are in a series of particular time periods or intervals. The data are considered in three types:

Time Series Data

A set of observations on the values that a variable takes at different times.

  1. i.

    Cross-sectional data: Data of one or more variables, collected at the same point in time.

  2. ii.

    Pooled data: A combination of time series data and cross-sectional data.

The above dataset is time series data which were set of observations on the values that a variable taken at different times.

Autoregressive Integrated Moving Average Model (ARIMA model)

An ARIMA model is a class of statistical models for analyzing and forecasting time series data. It explicitly caters to a suite of standard structures in time series data and as such provides a simple yet powerful method for making skillful time series forecasts. It is a generalization of the simpler autoregressive moving average and adds the notion of integration. Each of these components is explicitly specified in the model as a parameter. A standard notation is used of ARIMA(p,d,q) where the parameters are substituted with integer values to quickly indicate the specific ARIMA model being used. The parameters of the ARIMA model are defined as follows:

  • p: The number of lag observations included in the model is also called the lag order.

  • d: The number of times that the raw observations are differenced is also called the degree of differencing.

  • q: The size of the moving average window is also called the order of moving average [6, 7].

    $$ARIMA(p,d,q):X_{t} = \alpha_{1} X_{t - 1} + \alpha_{2} X_{t - 2} + \beta_{1} Z_{t - 1} + \beta_{2} Z_{t - 2} + Z_{t}$$
    (1)

where \(Z_{t} = X_{t} - X_{t - 1}\).

Here, Xt is the predicted number of confirmed COVID-19 new cases at tth day, α1, α2, β1 and β2 are parameters where as Zt is the residual term for tth day.

A linear regression model is constructed including the specified number and type of terms, and the data are prepared by a degree of differencing in order to make it stationary, i.e., to remove trend and seasonal structures that negatively affect the regression model. A value of 0 can be used for a parameter, which indicates to not use that element of the model. This way, the ARIMA model can be configured to perform the function of an ARMA model, and even a simple AR, I or MA model. Adopting an ARIMA model [8] for a time series assumes that the underlying process that generated the observations is an ARIMA process. This may seem obvious but helps to motivate the need to confirm the assumptions of the model in the raw observations and in the residual errors of forecasts from the model.

Case Study

The dataset [5] has collected and downloaded from 1.1.2020 to 31.07.2020. Table 1 shows the data of four major columns which are total cases, new cases, total deaths and new deaths. In this research paper, only new cases have been considered for the reference. The new cases data are dependent variable for future prediction using ARIMA seires analysis.

Table 1 Data of India (Monthly)

Here, for observations, it has taken the four series which are total cases, new cases, total deaths and new deaths in order to understand how the coronavirus [COVID19] has affected the country. It is clear from Table1 that the situation was under control in the month of January and February. Only one series has been taken, which is new cases, for the time series analysis.

Figure 1 shows that the new cases keep on increasing monthly. The average also keeps on increasing. For creating the ARIMA model, firstly, it has to make the average stationary. It can be done by using the shift function. It will shift the data by 1. It makes the 0th entry as NaN and shifted to the 1th location entry. Then, the diff function has been used which makes the difference with the period 1. This difference function is d used in the ARIMA model (p, d, q). After that, it has taken the data from the 1th location as the 0th location is NaN which is “not a number.”

Fig. 1
figure 1

Line plot of new cases in India before using diff function

Figs. 2and 3 show that the autocorrelation can observe that it is showing maximum four lags and it has been taken maximum four lags in ARIMA model. Now, it has been created the train set and test set from the given set of new cases in India. The total set size is 85. Out of which author has divided train set which is 80% of total dataset (which is 68) and 20% of total dataset comes under test dataset (which is 17), which can be done by using following command. In the next step, the AR (autoregression) model has been implemented. This is the first parameter of the ARIMA model which is p. Then, observe the prediction accuracy in the AR model. The line plot has been drawn of observed values and predicted values.

Fig. 2
figure 2

Line plot of new cases in India after using diff function

Fig. 3
figure 3

Line plot of autocorrelation of new cases in India

Fig. 4 shows the line plot of observed values and the predicted values using AR model. It can be observed that it requires finer tuning between observed and predicted values. Finally, ARIMA model has been implemented in order to observe the relation between observed values this is not but the test data and the predicted values. As per the above discussion, ARIMA model requires three parameters which are p, d, q. After that it can be seen the summary of the ARIMA model result and implementing akaike information criteria function using Python. It is a single number score that can be used to determine which one of the multiple models is most likely the best model for the given set of combinations. Fig. 5 shows the ARIMA model summary and the aic value. This is the best value in the multiple combinations of (p, d, q). This aic value is achieved by (2, 2, 4) combination.

Fig. 4
figure 4

Line plot of AR model

Fig. 5
figure 5

ARIMA model results

This is the test data and the predicted data using ARIMA model, the prediction values are 17 as the step = 17. Now we observe the line plot (Fig. 6) of the above result.

Fig. 6
figure 6

Line plot of ARIMA model

As it can be observed that now the relation between observed values and predicted values is finely tuned with the combination of (p = 2, d = 2, q = 4). Now it can be shown that the residual plot of ARIMA model by using resid function. Line plot (Fig. 7) of the residual errors, is shown that there may still be some trend information not captured by the model.

Fig. 7
figure 7

Line plots of residual errors

Next, it is defined by a density plot (Fig. 8) of the residual error values, suggesting the errors are Gaussian but may not be centered on zero.

Fig. 8
figure 8

Gaussian curve

figure a

The distribution of the residual errors is displayed. The results show that indeed there is a bias in the prediction (a nonzero mean in the residuals). Note that, although above it is used the entire dataset for time series analysis, ideally it would perform this analysis on just the training dataset when developing a predictive model. Table 2 shows that data are more precise in ARIMA model as compared to AR model.

Table 2 Comparison between AR and ARIMA model

Conclusion

From the above results and line plots, it is shown that the predictive values from ARIMA model are more fit than AR model. These predictive values are for next coming days. It is clear that the data of new cases will keep on increasing as the predictive values keep on increasing. In the coming future, human being has to fight with coronavirus and follow the Government of India guidelines. Lockdown is not the only solution. Lockdown is not the only solution, but it is also the social and moral responsibility of each and every citizen to follow the guidelines. Today the whole world is at high risk. In India most of the active cases are asymptomatic which mean that person is infected but does not show the symptoms of the coronavirus. The name of the disease is COVID-19, which is dangerous. People have to be well aware of the virus. Presently, India’s new cases data are not so alarming as compared to our population. This is only possible with the series of lockdowns. New cases data will keep on increasing for some time after unlocking the country, but after some time, these data will be under control.