In this section, we have discussed in details of our proposed scheme based linear regression model for prediction of the number of total confirmed cases, active positive cases, and recoveries. Firstly, infection spreading has been discussed, followed by the linear regression model used in the proposed work. In the proposed scheme, different types of data of various states such as data for confirmed cases, active positive cases, and recoveries have been collected.
Infection spreading: doubling exponential model
We have defined the infection spreading from the concept of exponential growth function or in particularly from the doubling exponential. First we describe the double exponent in brief and then it shows how the infection spreading is adopted from the doubling exponential. So far, many mathematical models characterized the early epidemic growth feature follow an exponential curve. Some of them characterized the exponential growth by the doubling time. The doubling time implies the time taken for the number of infections to double from a given day. We have also adopted the doubling concept with the different interpretation. Hence, for our model we consider the exponential function as.
$$\begin{aligned} y(t)= 2^t \end{aligned}$$
(1)
Here, instead of finding the number of infections at time t, we find the doubling time from the given y(t) i.e. the number of positive cases. In our experiment, define the doubling time as the number of days taken to become the double of the current count. Mathematically it can be defined as, if n is the positive case count at time t then \((2\times n)\) is the positive case count at time at a time \((t + t_{d})\), where \(t_d\) is the doubling time. It is clear that the doubling time is inversely proportional to the infection spread. Therefore, the higher the doubling time indicates that the infection is spreading slowly. On the other hand, a lower doubling time signifies the faster spread of infection and the constant doubling time implies the infection growing at a constant exponential rate. The minimum doubling time indicated the most growing rate could be considered at the peak point of the pandemic. There are several external interventions like social distancing, lockdown, and containment of the red zone of infection, number of testing per day, etc. varies the doubling time. The infection spreading is shown by the following graphical representation. Since the doubling time is inversely proportional to the infection spread, hence for the graphical representation we have used the following technique. As we know that the higher doubling time indicates, the infection is spreading slowly, and lower doubling time indicates an infection is spreading rapidly, i.e. we can say that doubling time is inversely proportional to the infection spreading. Therefore, for better visualization, we give the graphical representation infection spreading of positive cases over time rather than doubling time vs number of positive cases. Where the infection spreading is calculated by the formula (maximum doubling time - doubling time). For example, we have the doubling time (Table 1) 2, 36, 3, 10, 3, 6, 22, 36, 12 and 14. The maximum doubling time = 36 then we have the value of infection spreading (\(36-2\)) = 34, (\(36-36\)) =0, (\(36-3\)) =33, (\(36-10\)) = 26, and so on.
For better visualization and insight, we consider the states Maharashtra, Delhi, Kerala, West Bengal and Assam and shows in Fig. 2. Figure 2 clearly shows that initially it is a highly spreading situation. But, practically that was not happened; the graph shows it because of the initial low value. The positive case starts with one, then it became double \((2 \times 1)\) as two, \((2\times 2)\) four, and so on but it does not mean high infection but is the boundary value problem. Next, it shows the infection spreading reaches the highest level i.e. the peak and gradually spreading became low and again it is gradually high. It indicates in the state Kerala there is a second phase infection spreading is going on. The most notable case for Delhi, it shows that it is gradually diminishing the spreading curve. On the other hand, for the state West Bengal the spreading is tends to high.
Piecewise linear regression model
In our prediction model the piecewise linear regression have been used, it is a special case of the linear regression. Sometimes data do not follow the linear pattern as shown in Fig. 3a. However, if it still tries to model them using the linear regression then it will not be properly correlated. When such a model uses to predict, then it results high error value. In that situation, one line simply is not enough to fit the data, then the concept of piecewise linear regression comes to overcome such limitation as depicted in Fig. 3b. When the data set follows different linear trends over the different partitions of data, then we should model the regression function in several pieces. Each linear regression is corresponding to a partition is the pieces and the pieces are connected or not connected depends on the data and the problem. In case of connections, the connecting points are known as the break points, i.e. the points where the slope changes.
The point at \(x = p\) is the joining point of two lines, i.e., a breakpoint. Our assumption is that the regression function to be continuous at the breakpoint, the two values for y need to be equal at the breakpoint (when x = p), i.e., we have the relation
$$\begin{aligned} c_1+m_1 p=c_2+m_2 p \,\hbox {or}\, c_2=c_1+(m_1-m_2 )p \end{aligned}$$
The same concept can be extended for more than two breakpoints, and it depends on the data. To implement this model from a given data set, the main challenge is to partition the data set for the piecewise regression. In other words, the problems are to find out the breakpoints from the data set. In our experiment, we have done by finding the slope of the consecutive pairwise points, i.e., if there are n points, then there will be \((n-1)\) such slopes. From these slopes, whenever there is an abrupt change, then we consider that point is the breakpoint. This is done by the heuristic approach by the observation of the slopes.