1 Introduction

Bangladesh is one of the world’s most densely populated country. According to the World Population Review 2020Footnote 1, it is placed in eight position of population rank and tenth position of density rank. Currently, the novel COVID-19 is rapidly spreading globally, and most of the cases, the number of infection is high in densest populated country. Due to one of the densest countries in the world, the virus has been spreading rapidly in Bangladesh. According to the Institute of Epidemiology, Disease Control and Research (IEDCR)Footnote 2 the first COVID-19 case in Bangladesh was found on March 8, 2020 [25]. Since then the number of infection is increasing rapidly. Therefore, it is imperative to analysis spreading of COVID-19 in Bangladesh and to predict future cases.

Most of the existing studies on spreading and forecasting COVID-19 [5, 10, 12, 20, 27, 36, 38, 40] have been performed solely on the basis of COVID-19 situation. However, there have been only a few research (e.g. [7, 22, 33]) that have studied the effect of population density on the spread of coronavirus infection. Thus, in this study, we aim to investigate the effects of population density, literacy rate, and division on spreading and forecasting COVID-19 infection in Bangladesh. We started our interest by looking through Bangladesh’s COVID-19 data with the demographic variables where we used both Fisher’s Exact test [29], and ANOVA test [6]. For forecasting analysis, we have used Holt’s method [39] because it is a very effective for the trend data with no seasonality [14]. Furthermore, to reduce forecasting error, we applied Unreplicated Linear Functional Relationship (ULFR) model [41] to find best smoothing constant (a parameter used in forecasting process). The prediction value would help the government to take proper preparation to tackle the potential unprecedented situations in Bangladesh. The key contributions of this paper are threefold:

  • We investigate the association between COVID-19 and three demographic variables such as population density, region/division and literacy rate through a Fisher’s Exact test.

  • We further examine the significant difference in the mean infected number of COVID-19 cases across the various combinations of three demographic variables through a ANOVA test.

  • We adopt Holt’s method to predict the number of infected cases in the epidemic peak region in Bangladesh. To reduce forecasting error, we have utilized a single level of smoothing constant of Holt’s method.

The rest of the paper is organized as follows. Section 2 presents related work. Our study methodology is presented in Section 3. After reporting COVID-19 spreading analysis in Sections 4 and 5, we present our spreading results and forecasting discussion in Section 6. Finally, Section 7 concludes the paper and discuss the future work.

2 Related Work

Over the recent few months, a number of research have analysed the impact of various aspects such as climate and demographic variables on the spread of COVID-19 around the world.

There are different variables that have a significant impact on the spread of COVID-19 globally [4]. For instance, Ahmadi et al. [1] used the number of infected people in Iran with COVID-19, population density, average temperature, average precipitation, humidity, wind speed as the main parameters and tried to understand the effects of these parameters on spreading COVID-19 in Iran. Rader et al. [28] found that the accessibility to COVID-19 testing sites in USA is increased with high population density. Wang et al. [38] argued that the effect of climatic factors on spreading of COVID-19 can play an important role in the new COVID-19 outbreak. Therefore, it is clear that factors mentioned above have a significant and direct relationship with the number of infected people. For analysis spreading and forecasting COVID-19, a number of models have been used such as SIDR [9], DASS-21 [2], Fuzzy Clustering [22], and SEIR [26]. A summary of the existing work is presented in Table 1. Two studies [10] and [13] have used mathematical models to predict the number of infected cases. Some others used Genetic Programming and Regression model for short-term prediction [19, 30]. However, due to the small number of data or parameter selection problem, many of those models’ outcomes have shown a wide range of dissimilarities. Therefore, selecting parameter value is a key for the model prediction.

Table 1 Key studies on spreading and forecasting COVID-19

While only a few research have studied COVID-19 situation in Bangladesh, there is no research work conducted by using more than one quantitative analysis. Thus, in this paper, we have conducted two quantitative analyses: Fisher’s Exact test and ANOVA test. Furthermore, we have predicted the number of infected cases in the epidemic peak region of Bangladesh, Dhaka division.

3 Methodology

Figure 1 shows the methodology used in this study. First, we have collected the dataset. In this study, we have used two types of dataset: (1) daily region/division-wise COVID-19 infected cases in Bangladesh from 5th April to 6th June 2020, and (2) demographic data of Bangladesh. The dataset is freely available to download from the GitHub repositoryFootnote 3. The daily COVID-19 infected cases for Bangladesh were collected from IEDCRFootnote 4, and the demographic data is collected from the Bangladesh Bureau of Statistics (BBS)Footnote 5.

Fig. 1
figure 1

The study methodology

After the data collection, we analyse the association between COVID-19 and demographic variables such as population density, literacy rate, and division, followed by the analysis of forecasting COVID-19 infected cases.

4 COVID-19 Spreading Analysis

A two-phase study was conducted that consisted of an initial Fisher’s Exact test followed by a ANOVA test. In the following subsections, we present the association between the infected groups of COVID-19 and demographic variables which were identified by the Fisher’s Exact test. During the ANOVA test, we then present the significant difference in the mean infected number of COVID-19 cases across the divisions, population density, and literacy rate in Bangladesh.

4.1 Fisher’s Exact Test

The Fisher’s Exact test is considered to investigate whether there is any association between the infected groups of COVID-19 and demographical variables (divisions, literacy rate classes, population density groups) in Bangladesh. The variables and their corresponding categories used in this phase is presented in Table 2. According to Starnes et al. [35], the assumptions of chi-square test [24] is: “No more than 20% of the expected counts are less than 5 and all individual expected counts are 1 or greater”. But in our study the assumptions of chi-square are not satisfied. Therefore, instead of chi-square test, this research uses Fisher’s Exact test [29].

Table 2 Categories of demographical variables for Fisher’s Exact test as well as ANOVA test

4.2 ANOVA Test

The purpose of Analysis of Variance (ANOVA) test [6] was to examine whether there is any significant deference in the mean infected number of COVID-19 cases across eight divisions, five population density groups and two literacy rate classes in Bangladesh. Thus, seven statistical hypotheses were considered to run ANOVA are as follows:

H01 : There is no difference in the mean infected number of COVID-19 cases among the eight divisions in Bangladesh.

H02 : There is no difference in the mean infected number of COVID-19 cases among the five population density groups in Bangladesh.

H03 : There is no difference in the mean infected number of COVID-19 cases among the two literacy rate classes in Bangladesh.

H04 : There is no interaction effect of divisions and population density on the mean infected number of COVID-19 in Bangladesh.

H05 : There is no interaction effect of divisions and literacy rate on the mean infected number of COVID-19 in Bangladesh.

H06 : There is no interaction effect of population density and literacy rate on the mean infected number of COVID-19 in Bangladesh.

H07 : There is no combine effect of divisions, population density and literacy rate on the mean infected number of COVID-19 in Bangladesh.

4.3 Pairwise Comparison: Tukey Test

From the results of ANOVA test in Table 5, we found that there is a significant difference in the mean affected number of COVID-19 cases across the five different population density groups in Bangladesh. Therefore, we run the post hoc test (Tukey test) to investigate which pairs of the population density groups are different in terms of the mean infected number of COVID-19 cases in Bangladesh. So, we use ten statistical hypotheses to run Tukey test that are given below:

Hypothesis for Pair 1

H0 : μlow = μsemilowvs.Ha : μlowμsemilow

Hypothesis for Pair 2

H0 : μlow = μmediumvs.Ha : μlowμmedium

Hypothesis for Pair 3

H0 : μlow = μsemihighvs.Ha : μlowμsemihigh

Hypothesis for Pair 4

H0 : μlow = μhighvs.Ha : μlowμhigh

Hypothesis for Pair 5

H0 : μsemilow = μmediumvs.Ha : μsemilowμmedium

Hypothesis for Pair 6

H0 : μsemilow = μsemihighvs.Ha : μsemilowμsemihigh

Hypothesis for Pair 7

H0 : μsemilow = μhighvs.Ha : μsemilowμhigh

Hypothesis for Pair 8

H0 : μmedium = μsemihighvs.Ha : μmediumμsemihigh

Hypothesis for Pair 9

H0 : μmedium = μhighvs.Ha : μmediumμhigh

Hypothesis for Pair 10

H0 : μsemihigh = μhighvs.Ha : μsemihighμhigh

5 Forecasting Infected Cases

In the existing study, various models such as Ace-Mod (Australian Census-based Epidemic Model) [8], clustering method [22], SEIR models [26] and others ([9], [34] and [3]) have been employed to forecast future infected cases. However, these models are limited to analyse time series data. Therefore, this study uses Holt’s method to predict the number of infected cases in the epidemic peak region. It utilizes a single level of smoothing constants to compare forecasting performance. Therefore, to find the optimal smoothing constant, this analysis uses unreplicated linear functional relationship model as most of the research uses rule of thumb.

5.1 Unreplicated Linear Functional Relationship (ULFR) Model

To measure the functionality between a continuous dependent variable and independent variable, linear regression model has been used. Sometimes, the functionality will become obscure because of random variations accompanying with variables. Therefore, Fuller [11] has figured out, it is unfeasible if it apply an independent variable in all conditions.

$$ \begin{array}{@{}rcl@{}} Y_{i}= \beta_{a} + \beta_{f}X_{i} \end{array} $$
(1)

The functional model, where both dependent and independent variables are subject to errors. Suppose that Yi and Xi are unobservable dependent and independent variables respectively which correspond to random variables yi and xi that are observed with errors, 𝜖i and δi respectively, such that,

$$ \begin{array}{@{}rcl@{}} y_{i} & =&Y_{i}+ \epsilon_{i} \\ x_{i} & =&X_{i}+ \delta_{i} \text{ where i = 1, 2, 3, ......, n.} \end{array} $$
(2)

Moreover, the following conditions are assumed

$$ \begin{array}{@{}rcl@{}} E(\delta_{i})= E(\epsilon_{i})=0, Var(\delta_{i})={\sigma^{2}_{d}}, Var(\epsilon_{i})={\sigma^{2}_{e}}, \\ \forall{i}Cov(\delta_{i}, \delta_{j})= Cov(\epsilon_{i}, \epsilon_{j})=0, i\neq j\\ Cov(\delta_{i}, \epsilon_{j})=0, \forall{i}, j \end{array} $$
(3)

Chang et al. [20] termed the model as mentioned in Eq. 1 as the unreplicated linear functional relationship (ULFR) model when there is only the variables X and Y, and where δi and 𝜖i are random variables that are mutually independent and normally distributed. The log-likelihood function is given by

$$ \begin{array}{@{}rcl@{}} { L(\beta_{a}, \beta_{f}, {\sigma^{2}_{d}}, {\sigma^{2}_{e}}, X_{i})= -nln(2\pi)} {-\frac{1}{2}n(ln{\sigma^{2}_{d}}+ln{\sigma^{2}_{e}}) -\sum\limits_{i-1}^{n}\frac{(x_{i}-X_{i})^{2}}{2{\sigma^{2}_{d}}}}\\ {-\frac{1}{2{\sigma^{2}_{e}}}\sum\limits_{i=1}^{n}(y_{i}-\beta_{a}-\beta_{f}X_{i})^{2}} \end{array} $$
(4)
$$ \begin{array}{@{}rcl@{}} {L(\beta_{a}, \beta_{f}, {\sigma^{2}_{d}}, {\sigma^{2}_{e}}, X_{i})=} {-nln(2\pi)-\frac{1}{2}n(ln{\sigma^{2}_{d}}} {+ ln{\sigma^{2}_{e}}) -\sum\limits_{i-1}^{n}\frac{(x_{i}-X_{i})^{2}}{2{\sigma^{2}_{d}}}}\\ {-\frac{1}{2{\sigma^{2}_{e}}}{\sum}_{i=1}^{n}(y_{i}-\beta_{a}-\beta_{f}X_{i})^{2}}. \end{array} $$
(5)

When the ratio of the error variance is known, that is \( \frac {{\sigma ^{2}_{e}}}{{\sigma ^{2}_{d}}}=\lambda \),

$$ \begin{array}{@{}rcl@{}} {L(\beta_{a}, \beta_{f}, {\sigma^{2}_{d}}, {\sigma^{2}_{e}}, X_{i}) =} {-nln(2\pi)-\frac{1}{2}nln{\sigma^{2}_{d}}- \frac{1}{2}nln } { \lambda {\sigma^{2}_{d}} -\sum\limits_{i-1}^{n}\frac{(x_{i}-X_{i})^{2}}{2{\sigma^{2}_{d}}}}\\ {-\frac{1}{2 \lambda {\sigma^{2}_{d}}}\sum\limits_{i=1}^{n}(y_{i}-\beta_{a}- \beta_{f}X_{i})^{2} } \end{array} $$
(6)

then the maximum likelihood estimators of parameters \( \beta _{a}, \beta _{f}, {\sigma ^{2}_{d}}, and X_{i} \) respectively and equate the result to zero:

$$ \begin{array}{@{}rcl@{}} { \frac{\delta L}{\delta \beta_{a}} =}{- \frac{1}{2 \lambda \hat{\sigma}^{2}_{d} }} {\sum\limits_{i=1}^{n}2(y-\beta_{a}-\beta_{f}*X_{i})(-1)=0 } \end{array} $$
(7)
$$ \begin{array}{@{}rcl@{}} { \frac{\delta L}{\delta \beta_{f}} =}{ - \frac{1}{2 \lambda \hat{\sigma}^{2}_{d} } \sum\limits_{i=1}^{n}2(y-\beta_{a}-\beta_{f}*X_{i})(-X_{i})=0 } \end{array} $$
(8)
$$ \begin{array}{@{}rcl@{}} { \frac{\delta L}{\delta X_{i}} =} {- \frac{1}{2 \hat{\sigma}^{2}_{d} }\sum\limits_{i=1}^{n}2(x_{i}-X_{i})(-1)} {-\frac{1}{2 \lambda \hat{\sigma}^{2}_{d}} {\sum\limits_{1}^{n}}{2(y- \beta_{a} }} { {-\beta_{f}*X_{i})(-\beta_{f})} =0 } \end{array} $$
(9)
$$ \begin{array}{@{}rcl@{}} {\frac{\delta L}{\delta \sigma_{f}} = - \frac{n}{\hat{\sigma}_{d}}-\frac{n}{\hat{\sigma}_{d}} + \sum\limits_{i-1}^{n}\frac{(x_{i}-X_{i})^{2}}{\hat{\sigma}_{d}} } { +\frac{1}{\lambda \sigma_{d}} {\sum\limits_{1}^{n}}{(y_{i}-\beta_{a}}} {{-\beta_{f}X_{i})^{2} }=0 } \end{array} $$
(10)

After simplification of the equations the maximum likelihood estimators of parameters \( \beta _{a}, \beta _{f}, {\sigma ^{2}_{d}}, X_{i} \) are as follows:

$$ \begin{array}{@{}rcl@{}} { \beta_{a}=\overline{y}-\beta_{f}\overline{x} } \end{array} $$
(11)
$$ \begin{array}{@{}rcl@{}} { \beta_{f}= \frac{(S_{yy}-\delta S_{xx})+(((S_{yy}-\lambda S_{xx})^2+4 \lambda S^2_{xy})^{\frac{1}{2}}}{2S_{xy}}} \end{array} $$
(12)
$$ \begin{array}{@{}rcl@{}} {{\sigma^{2}_{d}}= \frac{1}{n-2(\sum(x_{i}-X_{i})^{2}+\frac{1}{\lambda}\sum(y_{i}-\beta_{a}-\beta_{f}X_{i})^{2})} } \end{array} $$
(13)
$$ \begin{array}{@{}rcl@{}} { X_{i}=\frac{\delta x_{i}+\beta_{f}(Y_{i}-\beta_{a})}{\lambda+\beta_{f}} } \end{array} $$
(14)

where \(\overline {y}=\frac {\sum y_{i}}{n}, \overline {x}= \frac {\sum x_{i}}{n}, S_{yy}= \sum (y_{i}-\overline {y})^2, S_{xx}=\sum (x_{i}-\overline {x})^2, S_{xy}=\sum (x_{i}-\overline {x})(y_{i}-\overline {y})\) and Coefficient of determination of ULFR \({R^{2}_{f}}\) for δ= 1

$$ \begin{array}{@{}rcl@{}} {{R^{2}_{f}}=\frac{SS_{r}}{S_{yy}}} \end{array} $$
(15)
$$ \begin{array}{@{}rcl@{}} {SS_{r}= \frac{\beta_{f}(S_{yy}-S_{xx})+2\beta_{f}S_{xy}}{1+{\beta^{2}_{f}}}} \end{array} $$
(16)

5.2 Holt’s Method

Exponential smoothing was suggested in the late 1950s, and has driven few successful predicting approaches [14]. Predicting models has been created using exponential smoothing methods which are weighted averages of former remarks, with the measurement decaying exponentially because the observations get older. Alternatively, Holt’s two-parameter model is a popular smoothing model for forecasting data with trend and without seasonality. The model also known as linear exponential smoothing model. To produce an ultimate forecast, Holt’s model consists of three distinct equations that work all together. A basic smoothing equation, level equation, is that openly modifies the last smoothed value for the preceding trend. In the second equation, trend equation, is articulated as the difference between the last two smoothed values. To end the third, forecast equation, is used to measure the final forecast. Holt’s model applies two smoothing parameters. Two parameters are defined respectively overall smoothing and the trend smoothing. The Holt’s method is also named trend-enhanced exponential smoothing or else double exponential smoothing [37]. Therefore, the following equations are as follows:

$$ \begin{array}{@{}rcl@{}} \textbf{Forecast equation: } {F_{t,k}= Y_{t} + KZ_{t} } \end{array} $$
(17)
$$ \begin{array}{@{}rcl@{}} \textbf{Level equation: } {Y_{t}= \alpha X_{t} + (1-\alpha)(Y_{t-1}+Z_{t-1})} \end{array} $$
(18)
$$ \begin{array}{@{}rcl@{}} \textbf{Trend equation: } { Z_{t}= \beta (Y_{t}-Y_{t-1}) + (1-\beta)Z_{t-1} } \end{array} $$
(19)

where Yt represents an estimation of the level at time t, Zt symbolizes measurement of the trend which means slope at time t, (0 ≤ α ≤ 1) is the smoothing parameter(SP) for the level and β (0≤ β ≤ 1) is the SP for the trend. The variable Xt is defined as the period t base level from the current period. Additionally, estimation of the period t base level based on previous data is noted as (Yt− 1 + Zt− 1). To measure Zt, a weighted average of the resulting two measures are taken:

  1. (i)

    An estimate of trend from the current period given by the increase in the smoothed trend from period (t-1) to period t.

  2. (ii)

    The notation Zt− 1, which is the previous estimate of the trend at time (t-1)

To start Holt’s method, a primary estimation (call it Y0) of the level and an initial estimation (call it Z0) of the trend are needed. Here, Z0 equals to the average increase in the time series during the previous year and Y0 equals to last observation.

6 Result and Discussion

In this section, we present the findings of the spread of COVID-19 followed by the discussion on forecasting results.

6.1 Findings from Spreading Analysis

The result of Fisher’s Exact test is reported in Table 3. Based on the results as shown in Table 3, we have analysed the association between demographic variables and the spread of COVID-19 in the case of Bangladesh. Although existing studies such as [21] and [23] provided there is a significant association between divisions and infected groups of COVID-19, in our findings in Table 3, it appears that there is no significant association between divisions and infected groups of COVID-19 (Fisher’s Exact test = 18.521, p-value = 0.063 > 0.05) at 5% level of significance.

Table 3 Fisher’s exact test

Additionally, Table 3 present the results of our investigation that there is no significant association between literacy rate classes and affected groups of COVID-19 (Fisher’s Exact test = 0.676, p-value = 0.776 > 0.05) at 5% level of significance. However, it shows that there is significant association between population density groups and affected groups of COVID-19 (Fisher’s Exact test = 14.686, p-value = 0.027 < 0.05) at 5% level of significance. This finding supports the findings of [1], [28] and [32]. Therefore, to measure the strength of the association between population density groups and infected groups of COVID-19; we use the Cramer’s V results in Table 3. From the result of Cramer’s V, it can be concluded that the strength of the association between population density groups and affected groups of COVID-19 is very strong (Cramer’s V = 0.374) and the strength is significant (p-value = 0.027 < 0.05).

The results of ANOVA test and Turkey test are available in Tables 4 and 5 respectively. From the Table 4, we found that the p-value < α = 0.05 for the second hypothesis. Therefore, reject H02 at 5% level of significance and conclude that there is sufficient evidence to show a significant difference in the mean affected number of COVID-19 cases across the five different population density groups in Bangladesh. For all other remaining hypotheses, the p-value > α = 0.05, therefore do not reject the null hypotheses at 5% level of significance. Thus, it can be concluded that the mean infected number of COVID-19 cases is the same for the eight divisions and two literacy rate classes in Bangladesh. There is insufficient evidence to show a significant interaction between divisions and population density; divisions and literacy rate; population density and literacy rate. In addition, there is no sufficient evidence to show a significant combine effect of divisions, population density and literacy rate.

Table 4 ANOVA test
Table 5 Post hoc test — Tukey test

we found that the p-value < α = 0.05 for the pairs 4, 7, 9 and 10. Therefore, reject H0 at 5% level of significance and conclude that the mean infected number of COVID-19 cases is different across the low and high population density groups, semi-low and high population density groups, medium and high population density groups, semi-high and high population density groups in Bangladesh. For all other remaining pairs, the p-value > α = 0.05, therefore, do not reject H0 at 5% level of significance and conclude that the mean infected number of COVID-19 cases is not different across the remaining pairs of population density groups in Bangladesh.

6.2 Discussion of Forecasting Results

In this section, we will discuss the effectiveness of our proposed model as well as show the forecasting trend of COVID-19 in Bangladesh which is determined based on the demographic data and COVID-19 infected cases from 5th April to 6th June 2020. There are six parameters A = (α = 0.2, β = 0.8), B = (α = 0.8, β = 0.2), C = (α = 0.5, β = 0.5), D = (α = 0.05, β = 0.45), E = (α = 0.45, β = 0.05) and F*= (α = 0.76, β = 0.24) in our experiments where F* is our proposed smoothing constant parameter. Therefore, the Holt’s method is employed to forecast over the post-sample period from June 2020 to December 2021.

In our study, we choose Dhaka as the forecasting division because this division is among the highest in the total number of confirmed COVID-19 cases, and highly populated in Bangladesh. In Table 6, we provide a detailed information about the different smoothing constant (SC) value, and the forecasted value of the number of COVID-19 infected people in Dhaka at the end of 2021. The table is also containing mean absolute percentage error (MAPE) value for each SC value. The forecasted values for different SC values show that the number of COVID-19 infected people will be around 0.5 million in Dhaka at end-of-year 2021. Table 6 also presents mean absolute percentage error, original cumulative number and the forecasted cumulative number of COVID-19 infected people for our proposed smoothing constant parameter F*. Prediction value is closer for the SC value of F*. The MAPE value shows 13.6% for F* whereas for other different parameters the MAPE values are arround 24%. Therefore, the smoothing constant values for F* (α = 0.76, β = 0.24), which is determined our ULFR provide the best results for forecasting.

Table 6 Forecasting number of infected people for different parametric value

In Fig. 2, we compared the daily short-term forecasts of cumulative case counts. Table 6 provides the forecasting values 683,131 and 394,233 for the SC A, E respectively and the predicted values are 498,575, 511,045 and 559,815 for the smoothing constant B, C, and D, respectively. For the SC F, forecasting value at the end of December 2021 is 517,093. We also observe from Table 6 that F is closer to the forecast values of B, C, and D but the short-term accuracy level is higher than all other forecasting constant values. According to Rohit et al. [31], COVID-19 is now increasing globally at a rate of 3% to 5% daily. Therefore, the result in this paper showed that the total number of infected cases will be around half a million the Dhaka division at the end of 2021.

Fig. 2
figure 2

Day wise forecasting number and original number of infected people

Overall, this study will help people to prepare and plan for better regular life activities as the study included consciousness and understanding of various variables that can accelerate or decline the rate of COVID-19. While the study needs more data to make better explicit predictions, our model could help to forecast Confirmed COVID-19 cases if the spread of the virus does not change dramatically (means beyond explanation). Therefore, implementation of advanced machine learning, deep learning algorithms and visualisation techniques could help to forecast and visualise confirmed COVID-19 cases accurately [15,16,17,18]. However, to the best of our knowledge, the proposed model is highly effective at the time of writing this paper.

7 Conclusion and Future Work

This research investigated the association between COVID-19 and demographic variables in Bangladesh. It made several contributions to the literature. First, this study uses the Fisher’s Exact test to investigate whether there is an association between the infected groups of COVID-19 and demographical variables such as divisions, literacy rate classes, and population density in Bangladesh. Second, it also uses the ANOVA test to examine whether there is any significant difference in the mean infected number of COVID-19 cases across the divisions, literacy rate, and population density in Bangladesh. Third, using Holt’s method, this research forecasts the number of infected cases in the epidemic peak region, Dhaka division by the end of the year 2021. Our result shows that there is a significant association between population density groups and infected groups of COVID-19 in Bangladesh as well as the strength of the association is very strong and it is statistically significant. ANOVA test indicates a statistically significant difference in the mean infected number of COVID-19 cases across the five different population density groups in Bangladesh. Finally, the post hoc test, Tukey test finds that the high population density group shows a significant difference in the mean infected number of COVID-19 cases.

In summary, our most recent forecasts, based on the last two months data (5th April to 6th June 2020), remained relatively stable. The proposed models predict that the epidemic has not reached its peak in Dhaka division, yet it would do so on December 2021. This likely shows the impact of the population density on spreading the virus. Educated people’s awareness does not impact on reducing the spread of the virus in its peak level in Bangladesh. The forecasts presented are based on the assumption that current mitigation efforts will continue. However, during our research, we encountered several limitations such as availability of required dataset, implementation of some other more accurate machine learning algorithms such as logistic regression and deep learning. The result will be more accurate with a huge dataset where all the COVID-19 patients information is confirmed. In future work, we will explore a machine learning technique to investigate the association and to predict the number of infected cases in the epidemic peak region in Bangladesh.