Forecasting the Spread of COVID-19 Using Deep Learning and Big Data Analytics Methods

Kiganda, Cylas; Akcayol, Muhammet Ali

doi:10.1007/s42979-023-01801-5

Forecasting the Spread of COVID-19 Using Deep Learning and Big Data Analytics Methods

Original Research
Published: 03 May 2023

Volume 4, article number 374, (2023)
Cite this article

Download PDF

SN Computer Science Aims and scope Submit manuscript

Forecasting the Spread of COVID-19 Using Deep Learning and Big Data Analytics Methods

Download PDF

Cylas Kiganda¹ &
Muhammet Ali Akcayol¹

1407 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

To contain the spread of the COVID-19 pandemic, there is a need for cutting-edge approaches that make use of existing technology capabilities. Forecasting its spread in a single or multiple countries ahead of time is a common strategy in most research. There is, however, a need for all-inclusive studies that capitalize on the entire regions on the African continent. This study closes this gap by conducting a wide-ranging investigation and analysis to forecast COVID-19 cases and identify the most critical countries in terms of the COVID-19 pandemic in all five major African regions. The proposed approach leveraged both statistical and deep learning models that included the autoregressive integrated moving average (ARIMA) model with a seasonal perspective, the long-term memory (LSTM), and Prophet models. In this approach, the forecasting problem was considered as a univariate time series problem using confirmed cumulative COVID-19 cases. The model performance was evaluated using seven performance metrics that included the mean-squared error, root mean-square error, mean absolute percentage error, symmetric mean absolute percentage error, peak signal-to-noise ratio, normalized root mean-square error, and the R2 score. The best-performing model was selected and used to make future predictions for the next 61 days. In this study, the long short-term memory model performed the best. Mali, Angola, Egypt, Somalia, and Gabon from the Western, Southern, Northern, Eastern, and Central African regions, with an expected increase of 22.77%, 18.97%, 11.83%, 10.72%, and 2.81%, respectively, were the most vulnerable countries with the highest expected increase in the number of cumulative positive cases.

A novel bidirectional LSTM deep learning approach for COVID-19 forecasting

Article Open access 20 October 2023

A Study of Time Series Forecasting Techniques for COVID-19 Trends

COVID-19 Cases Prediction Using Different LSTM Models and Comparison of Effectiveness of Different Models

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The coronavirus disease (COVID-19) is an epidemic that first appeared in Wuhan, Hubei Province, China, on December 31, 2019. It was initially reported as a cluster of pneumonia cases. After a thorough analysis of the severity of the spread, the World Health Organization (WHO) declared COVID-19 as a pandemic on March 11, 2020, as described in the research from WHO timeline to COVID-19 [4]. COVID-19 is caused by the SARS-CoV-2 virus and can infect anyone. In most cases, patients infected with this disease will recover with no strong treatments applied. According to the findings by WHO [22], individuals may display symptoms ranging from low to average. Individuals with chronic medical illnesses, particularly the elderly, are more likely to experience severe symptoms of this.

The COVID-19 virus spreads from person to person via tiny fluid spread when an infected person coughs, sneezes, speaks, or breathes. When these fluids become stuck on surfaces such as door handles, the virus can be spread to others who come into contact with these surfaces without taking the necessary medical precautions. To prevent the spread of this virus, it is recommended that people keep a 1-m distance from other people, wash their hands frequently or use a disinfectant, wear a mask, and get vaccinated as recommended by WHO [22].

Various approaches have been deployed to prevent and control the spread of the COVID-19 pandemic. Among these strategies is the prediction of the spread of the COVID-19 virus. In this context, the spread of COVID-19 is considered a time series problem to which deep learning forecasting algorithms and big data statistical models are applied. Among the deep learning algorithms are the long short-term memory (LSTM) model as applied by Marzouk et al. [12], Hssayeni et al. [6], Yu et al. [24], Zeroual et al. [25], Pal et al. [14] and Shastri et al. [16]; the convolutional neural network (CNN) model as applied in research by Huang et al. [8], which performs well on image data such as X-ray images; the autoencoder model, which was applied by Hu [7]; gradient boosting, which provided the best results in research conducted by Zoabi et al. [27]; and the Prophet model, which was applied by P. Wang et al. [19] to perform epidemiological trend prediction. Big data statistical models include models such as the auto-regressive integrated moving average (ARIMA) model as applied by Gebretensae and Asmelash [5] and the susceptible-exposed infectious-removed (SEIR) model, which has been proven to be a robust model to predict the trend of COVID-19 as applied by Yang et al. [23]. Among the deep learning models used to perform time series prediction, the LSTM has been widely used due to its successful results in most research experiments. On the other hand, the ARIMA statistical model has also been widely applied in the health sector, for example, in a study by Y. W. Wang et al. [20] to predict the spread of hepatitis B disease, in the forecasting of medical service demand by Y. Huang et al. [9] and in the prediction of daily blood sampling room visits by Zhang et al. [26].

The following questions will be addressed by this research:

1.
What is the best-performing prediction model given the COVID-19 cumulative positive cases data from African countries in five key regions?
2.
Is it possible to estimate the total number of cumulative positive cases 61 days ahead of time using the best prediction model?
3.
After a 61-day forecasting period, which countries on the African continent are in the most vulnerable position in terms of the COVID-19 virus's spread?

In this study, a comparative and analytical approach were followed to predict the spread of the COVID-19 virus. This approach includes two deep learning models and a statistical model. The deep learning models include LSTM and Prophet. The statistical model comprises the ARIMA model. In most studies, the modeled ARIMA model does not include the seasonal component of the problem. However, in this study, it is modeled to take into consideration the seasonal component of the time series problem. The spread of COVID-19 was considered to be a univariate time series problem using the number of COVID-19-positive cases. In “Model Selection Criteria”, the models used in this study are discussed in detail.

This study uses the African continent as a case study. In this comprehensive approach, the African continent was broken down into five major subregions, including Northern, Southern, Eastern, Western, and Central Africa. While most studies focus on a single or a few countries as a case study during the prediction of the spread of COVID-19, this study included and utilized all the African continent’s regions. In this study, the successful prediction model was selected by using seven performance indicators. The performance indicators include mean-square error (MSE), root mean-square error (RMSE), mean absolute percentage error (MAPE), symmetric mean absolute percentage error (SMAPE) R2 score, normalized root mean-square error (NRMSE), and peak signal-to-noise ratio (PSNR). In “The Framework Of The Applied Approach”, the performance metrics are provided in detail. The best-performing model was then used to perform the prediction of COVID-19 cases 61 days ahead of schedule. In “Results and Discussion”, the model results are provided and discussed in detail.

Related Work

In this section, prediction approaches and methods used in other research studies are addressed. These studies mainly concentrate on the prediction of the spread of COVID-19 using both statistical and deep learning tools.

In a research study by Gebretensae and Asmelash [5], the autoregressive integrated moving average (ARIMA) algorithm was used to forecast the spread of COVID-19 in Ethiopia. The autocorrelation function (ACF) and partial autocorrelation functions (PACF) were used to obtain the model’s optimal terms. It was observed that the ARIMA models, ARIMA (0, 1, 5) and ARIMA (2, 1, 3), produced the best results. Ribeiro et al. [15] developed a stacking-ensemble learning algorithm that included ARIMA, cubist regression, random forest, and support vector regression. In this study, the Gaussian process was employed as a meta-learner, while the random forest, ridge regression, and other algorithms were utilized as foundational learners. In this study, it was observed that the support vector regression algorithm produced the best results.

Abdulmajeed et al. [1] applied a deep learning ensemble method to predict COVID-19 cases in Nigeria. The emphasis in this study was to create a prediction method that uses as little data as possible to give accurate predictions. This was because there was a problem with limited training data for models to learn the COVID-19 spread. This deep learning approach combined four prediction approaches, which included one statistical method called ARIMA. Among the other deep learning models in the ensemble approach were the Prophet model (supported and provided by Facebook), the Holt–Winters exponential smoothing model, and the generalized autoregressive conditional heteroscedasticity (GARCH). While applying the ARIMA model, non-seasonal phenomena were used. To find the best ARIMA model, strategies such as brute search, autocorrelation function inspection, and partial autocorrelation function plots are used.

Wang et al. [19] used a hybrid prediction strategy to predict the COVID-19 cumulative cases in their study. This included the logistic and Prophet models. With the Prophet model, the primary focus was on modeling non-periodic changes. The model included the date and the total number of COVID-19 cases obtained from a specific country. The logistic model was used to identify the quickest rising point in the data in this hybrid method. The output of this model is then fed into the Prophet model, which is used to make the final forecast. Marzouk et al. [12] used three deep learning models to forecast the spread of COVID-19 in Egypt: the LSTM, convolutional neural network, and multilayer perceptron neural network. In this study, the COVID-19 data was modeled as a time series data. In this study, the LSTM outperformed the other two models.

Hssayeni et al. [6] used mobility data to predict the COVID-19 risk spread using the LSTM model and the gradient tree boosting model in their study. In this study, it was discovered that the number of daily cases decreased in the retiree context, while it increased in the youth context. Yang et al. [23], on the other hand, used the susceptible-exposed-infectious-removed (SEIR) and the LSTM models to forecast the spread of the COVID-19 pandemic in China. The SIER algorithm was used to model epidemiological and mobility data by specifying parameters, and the parameter was defined as the product of the daily number of people in contact with COVID-19 patients and the likelihood of transmission. σ was the amount of time it took for a COVID-19 patient to develop infection symptoms. Finally, γ was determined to be the average mortality or recovery rate. The rate of pandemic spread in Hubei province was determined using these parameters. These parameters were then fed into the LSTM model as input.

Zeroual et al. [25] used five models to predict new and recovered COVID-19 cases. The recurrent neural network, long short-term memory, bidirectional LSTM, gated recurrent units, and variational autoencoder were among the models used. The study was carried out in six different countries: Italy, Spain, France, China, the USA, and Australia. The variational autoencoder model produced the best results. The best model was used to forecast cases for the next 2 weeks. To forecast the positive COVID-19 outcome in a PCR test, Zoabi et al. [27] used the gradient-boosting algorithm in conjunction with the Shapley additive explanations (SHAP) bee-swarm plot. Sex, contact with COVID-19 patients, and the presence of the five most notable COVID-19 symptoms were all model input features. Techniques such as early stopping were used to improve the results.

Pal et al. [14] used the LSTM model and Bayesian optimization to determine COVID-19 risk categories. To obtain the hyperparameters, the search space had to be defined. The optimal hyperparameters were obtained and used by the model in the local trend prediction phase to perform country-specific predictions. Finally, a fuzzy rule-based risk categorization process was carried out, in which the data obtained from the previous module was used to determine each country’s risk status. This study concluded that weather had no significant impact on the spread of COVID-19.

Shastri et al. [16] conducted research on COVID-19 time series prediction and comparative analysis using variants of long short-term memory neural network models. Among them were models such as bidirectional long short-term memory, convolutional long short-term memory, and stacked long short-term memory. Two countries were used as case studies. Among these are the USA and India. Because models are sensitive to the size of data input values, tools like MinMaxScaler were used to perform data normalization. Various regions of the USA and India were divided into groups based on the severity of the COVID-19 situation. These were the initial, moderate, and severe groups. Regions with a high number of COVID-19 cases were classified as severe. When compared to the other two models, the convolutional LSTM model produced the best results.

In the related literature, several models have been used to forecast the spread of COVID-19 in a couple of countries. However, the African continent has not been extensively studied in this regard. This study aimed to close this gap by applying the most successful model (LSTM) among the rest of the forecasting models to conduct an extensive investigation and analysis of African states from the five major regions of the continent. In addition, the most critical states with the highest expected COVID-19 increase rate from each region were identified for immediate action in the region.

Methods and Materials

Data Gathering

Africa’s Geographical Regions and Populations

The case studies used in this study included countries from the five major regions of the African continent. These regions, as depicted in Figure 1, include the Northern, Eastern, Southern, Central, and Western regions.

Much work on the COVID-19 pandemic has been done in the literature. In some research, several or individual African countries have been used as case studies, for example, research done by Abdulmajeed et al. [1]. In this study, the African continent is considered from a broader perspective, including countries from each of the major regions that make up the continent. This study performs a comparative analysis of the COVID-19 pandemic spread.

COVID-19 Data

A humanitarian data exchange [2] source provided the COVID-19 dataset used in this research. This information was gathered by first splitting the data of each country into distinct groups based on the country's geography. The Northern, Southern, Central, Eastern, and Western regions of Africa were used in the study. Model fitting was then done for each country separately. This data was split into training and testing datasets, with the former accounting for 80% of the total prediction models.

ARIMA Model

The ARIMA model is made up of three main parts: the terms “AR,” “I,” and “MA” are among these elements. As mentioned by Noureen et al. [13], the “AR” term refers to the autoregression parameter. This shows that the variable under consideration in this context has a linear relationship between its present and prior values. That is to say, an AR(1) of order one implies that the current data point in the series is based directly on the immediate past data point, while an AR(2) implies that it is based on two past data points in the series by Kırbaş et al. [10]. The "I" component stands for the integrated element, which shows the amount of difference between the current data points and their preceding values. This is part of the ARIMA model that handles the data stationarity requirement for better results in ARIMA time series processing, which is attained by the differencing process as explained in the research by the Noureen et al. [13]. Stationarity in ARIMA processing refers to the condition when the mean and variance statistical parameters in the time series data are constant with respect to the time factor. The last part in the basic ARIMA structure is the "MA" part, which represents the moving average. This component displays the linear combination that exists between the error values at past intervals in the time series as denoted by Ribeiro et al. [15]. The standard notation of the basic ARIMA model is denoted as ARIMA (p, d, q). The p, d, and q terms represent the autoregressive, differencing, and moving average terms as described in the research by Abdulmajeed et al. [1]. The mathematical notation for the AR (p) term can be represented as shown in Eq. 1.

$$Y_{t} = \delta + \varphi_{{{1} }} Y_{t - 1} + \varphi_{{2}} Y_{t - 2} + \cdots + \varphi_{p} Y_{t - p} + \varepsilon_{t} .$$

(1)

In the above equation, Y_t denotes the time series value at a given time point t. The p, δ, and ε_t denote the autoregression term, fixed value, and the error value, respectively. The moving average component can be defined mathematically in Eq. 2.

$$Y_{t} = \mu + \varepsilon_{t} + \theta {}_{{1}} \varepsilon_{t - 1} + \cdots + \theta_{{2}} \varepsilon_{t - 2} + \theta_{q} \varepsilon_{t - q} .$$

(2)

In Eq. 2, q depicts the order of the moving average term. The difference term d can be obtained from Eq. 3.

$$\Delta Y_{t} = Y_{t} - Y_{t - 1} = Y_{t} - LY_{t} .$$

(3)

In Eq. 3, ∆Y_t denotes the stationary time series value at a time interval t.

$$(1 - \varphi_{1} L - \varphi_{1} L_{2} - ... - \, \varphi_{p} L_{q} )\Delta dY_{t} = \delta + \theta_{1} \, \varepsilon_{t - 1} + \cdots + \theta_{q} \varepsilon_{t - q} .$$

(4)

Equation 4 is a combination of all the equations for the basic ARIMA model terms. This denoted the full ARIMA (p, d, q) model equation with the complete set of terms computed and represented.

The partial autocorrelation function (PACF) and autocorrelation function (ACF) graphs, as shown in Fig. 2, can also be used to obtain the ARIMA model's p and q terms. The ACF plot is a graphical representation of the average correlation between data and prior values in a time series over different lag intervals. The only difference between the two exists in the fact that PACF reveals correlations within a shorter lag interval, as explained in the research by Noureen et al. [13].

Prophet Model

The Prophet model is a deep learning model for time series forecasting. The Facebook group created and maintains this model as an open-source initiative. According to Taylor and Letham [18], it is based on the generic specification of a generative additive model (GAM), which is a linear regression model whose linear variable is reliant on smoothing functions. GAMs can be quantitatively represented using Eq. 5.

$$g\left( {E\left( Y \right)} \right) \, = \, \beta 0 \, + f1\left( {x1} \right) \, + f2\left( {x2} \right) \, + \cdots + fm\left( {xm} \right).$$

(5)

In Eq. 6, Y represents the univariate response variable, x₁ represents the predictor variable, and f₁ represents the smoothing functions. Due to its use of GAM model formulation, the Prophet model has a variety of benefits, including flexibility and quick fitting times, and evaluates a time series problem from three perspectives, including trend, seasonality, and holiday components, as discussed in research carried out by Taylor and Letham [18]. The trend component takes into account the likelihood of time series data increasing or decreasing over time. Seasonality, on the other hand, looks at data changes that happen over a short time period.

$$y(t) = g(t) + s(t) + h(t) + \varepsilon_{t} .$$

(6)

The final predicted value y(t) is obtained from a combination of the trend, seasonal and holiday component functions as shown in Eq. 6 above, where ε_t represents the changes that are not captured by the model [18].

LSTM Model

The LSTM model is composed of three main core components. These include the forget gate, input gate, and output gate [16]. The forget gate identifies the degree to which past data is obliterated. The input gate receives the data that is taken into the cell’s internal state, while the output gate is used to create the next hidden state or output that is obtained from the existing internal state value.

The above figure displays the major building blocks of the LSTM model. It is evident that the main building blocks of the LSTM model consist of the forget gate, input gate and output gates as described by Le and Lee [11]. Several activation functions are used such as the tanh and sigmoid functions for extracting the optimal model weight values.

Model Selection Criteria

In this study, seven metrics were adopted to assess the predictive performance of the models. These metrics include, the peak signal-to-noise ratio (PSNR), mean-squared error (MSE), root mean-square error (RMSE), symmetric mean absolute percentage error (SMAPE), mean absolute percentage error (MAPE), normalized root mean-square error (NRMSE), and R2 score.

Mean-Square Error

The mean-squared error can be calculated numerically as below.

$${\text{MSE}}{\mkern 1mu} = {\mkern 1mu} \frac{1}{n}\sum\limits_{{1{\kern 1pt} = {\kern 1pt} 1}}^{n} {\left( {Y_{i}^{ \wedge } - Y_{i} } \right)^{2} }$$

(7)

The overall number of observations $n$, the exact value Y, and the anticipated value Y^ are all represented in Eq. 7.

Root Mean-Square Error

The RMSE can be calculated using Eq. 8.

$${\text{RMSE}}\,{ = }\,\sqrt {\frac{{1\sum\nolimits_{1\, = \,1}^{n} {\left( {Y_{i}^{ \wedge } - Y_{i} } \right)^{2} } }}{n}}$$

(8)

The overall number of observations $n$, the actual value Y, and the anticipated value Y^ are all represented in Eq. 8.

Mean Absolute Percentage Error

Equation 12 can be used to represent this performance measure numerically.

$${\text{MAPE}}\, = \,\frac{100\% }{n}\mathop \sum \limits_{t = 1}^{n} \left| {\frac{{A_{t} - F_{t} }}{{A_{t} }}} \right|.$$

(9)

The observed vector of numbers is represented by $A_{t}$, the projected value is expressed by $F_{t}$, and the total number of data points is represented by $n$ in Eq. 9.

Symmetric Mean Absolute Percentage Error

Equation 10 can be used to represent this measurement numerically.

$$SMAPE = \,\,\frac{100\% }{n}\mathop \sum \limits_{t = 1}^{n} \frac{{\left| {F_{t} - A_{t} } \right|}}{{\left( {\left| {A_{t} } \right| + \left| {F_{t} } \right|} \right)/2}}.$$

(10)

The observed vector numbers are represented by $A_{t}$, the forecasted value is represented by $F_{t}$, and the overall number of observations is represented by $n$ in Eq. 10.

Peak Signal-to-Noise Ratio

$$PSNR\, = \,20log_{10} \left( {\frac{{MAX_{f} }}{{\sqrt {MSE} }}} \right).$$

(11)

The highest signal value is expressed by $MAX_{f}$ in Eq. 11. $MSE$ stands for mean-square error.

Normalized Root Mean-Square Error

$$NRMSE = \frac{RMSD}{{Y_{max} - Y_{min} }}.$$

(12)

The root mean-square deviation (RMSD) is defined in Eq. 12. The RMSD measure is also known as the RMSE statistic (Fig. 3).

R2 Score

$$R_{2} \, = \,\frac{{\mathop \sum \nolimits_{i} \left( {y_{i} - f_{i} } \right)^{2} }}{{\mathop \sum \nolimits_{i} \left( {y_{i} - Y^{i} } \right)^{2} }}.$$

(13)

In Eq. 13, the projected values are represented by $f_{i}$, whereas the original values are represented by $y_{i}$, and the mean is represented by $Y^{i}$.

The Framework of Applied Approach

In Fig. 4, the major stages of this study include splitting the preprocessed positive COVID-19 cumulative cases data into 80% training and 20% testing datasets, fitting the models, validating the model performance using the performance, and then selecting the best-performing model to use it to forecast the future positive COVID-19 cases for the next 61 days.

Rationale for the Selected Models

This section aims to address the reasons for choosing the LSTM, ARIMA, and Prophet models to perform the prediction and forecasting of the COVID-19 cumulative positive cases data for the various African countries in this study.

LSTM

This model is a special class of recurrent neural network deep learning models with the capability to identify and learn the relationship that exists within a given series of data observations, as described in the research by Yu et al. [24]. This is possible because the LSTM has memory modules that act as a connection between past and current data points. Important data points with strong desired insights are retained, while those with weaker weights are disposed of in the forget module of the LSTM model. This both optimizes the model to concentrate on extracting the dependence that exists within a given input sequence and also minimizes the error by eliminating noise points from the learned data at this stage. As described by Zeroual et al. [25], the LSTM model eliminates the problem of vanishing gradients that is faced with traditional recurrent neural networks, whereby the computed gradient fluctuates within peak ranges, that is to say, either too big or too small. According to Zeroual et al. [25], this issue arises during the training phase. The LSTM model solves the vanishing gradient problem with the help of activation vectors used in the forget gate to determine the gradient values. It is at this point that the LSTM model, by using a summative strategy, identifies the optimal terms to adjust at a given time step, which improves accuracy and overall performance. The LSTM model implementation provides several hyperparameters, such as the batch and epoch numbers, which can be easily adjusted to obtain better results. This makes it easy to fit and use the LSTM model to achieve accurate results. These robust qualities of the LSTM model make it ideal for performing the time series prediction task.

ARIMA

This is a statistical method that uses regression in which past data points and errors are connected using weight factors, which improves the overall prediction results, as described in the research by Singh et al. [17]. This model also amalgamates the strengths of both the autoregression and moving average models, which further makes it a robust choice that extracts the inherent statistical relationship between the dependent and independent variables. It is a flexible model to use, since it incorporates the difference between data points both in the past and present context, which makes it able to handle and process data which is not stationary using a few parameters as described by Abdulmajeed et al. [1]. Another factor lies in the fact that it is easier to obtain the optimal parameter terms of this model using simple methods like the PACF and ACF plots, as described in the research by Gebretensae and Asmelash [5]. Also, metrics such as the Akaike information criteria and Bayesian information criteria make it possible to measure how good the ARIMA model is for a given combination of hyperparameter terms, which further makes it easier to streamline the prediction results. This model has the ability to process data with seasonal trends by further increasing the hyperparameter terms to include the seasonal factors, as explained by Y. Wang et al. [21]. This makes it possible to capture any seasonal relationship within the COVID-19 dataset at any given time.

Prophet

According to Abdulmajeed et al. [1], this is an additive regression model supported by Facebook with a robust architecture that takes into account seasonal dynamics within a given data sequence, such as yearly, weekly, and daily trends. It also handles data with missing data points and extreme values well, since it has the ability to identify data anomalies as described by Y. Wang et al. [21]. This makes it an ideal solution to process and predict the COVID-19 datasets in some countries with data of this nature, such as data that has sharp spikes from the normal trend in the general data. According to research by Letham and Taylor [18], the Prophet model has built-in computational support that handles non-linear growth curves when the natural boundary is reached and also offers flexibility in tuning, such as smoothing features that capture and model seasonality constraints in the data to make a good fit regarding historical cycles. It is also easy to capture and model the effects of events such as holidays in the time series data with the Prophet model using limited data [18]. These qualities make this model appropriate to perform the prediction of the COVID-19 spread.

Results and Discussion

In this study, countries from the African continent were grouped into the five groups named in “Data Gathering”. Three forecasting models were used, including the ARIMA, LSTM, and Prophet. In this section, the performance results obtained from these models are given for each region of Africa.

Model Training and Testing

Northern Africa

In the Northern region of Africa, of the six countries studied, the most densely populated country is Egypt, as shown in Fig. 2, with a population of 102334404, while the least populated country is Mauritania, with a population of 4649658 as observed in the work by Worldometer [3].

In Fig. 5, it can be seen that Morocco has maintained the highest number of COVID-19 cases over time. This was followed by Tunisia in this critical condition. On the other hand, Mauritania, on the other hand, has the lowest number of cases over time compared to other states in this region.

Libya has a relative increase in cases, with a gradual increase occurring between the months of October 2020 and July 2021. Beyond the month of July, a sharp increase that slowly reduces toward the month of October is observed. This clearly describes the first wave of COVID-19 cases in Libya. Algeria's trend is more similar to that of Libya’s. However, it is observed that the cases reach a constant number, while in Libya there is an increase.

According to Fig. 6, it is observed that the LSTM model fits better than both the ARIMA and Prophet models. In Tunisia, it can be observed that the Prophet model performs the worst in predicting the test data. This is because while the test data flattens to a constant case value, the Prophet model predicts a sharp increase of over 800000 cases. In countries like Egypt and Tunisia, the ARIMA and Prophet models predicted lower and higher cases, respectively, with respect to the actual data. Apart from these two countries, in the four other countries, both models predicted lower cumulative positive cases with regard to the actual data. This confirms the poor performance of these two models when compared to the LSTM model, which predicts better results close to the actual data in five countries except Egypt.

In Table 1, the best results in terms of the PSNR and R value can be observed with larger numbers, which implies that the greater the number, the better is the model’s relative performance.

Table 1 Performance parameters of the models for Northern Africa

Full size table

Central Africa

In this region, five states were studied. At the time of this study, the most populated state in this group was Cameroon, with a population of 26545863 [3]. On the other hand, the least populated state is São Tomé and Príncipe, whose population is 219,159.

In Fig. 7 above, the COVID-19 cumulative cases from the five countries in this region have been given. According to this graph, COVID-19 cases in Cameroon are higher than in the rest of the countries, with more than two significant waves. Cameroon is followed by Gabon, which also has more than two waves. The rest of the countries maintain a slightly constant curve, with minor increases in COVID-19 cases. The lowest number of cases is seen in São Tomé and Príncipe. A positive correlation is observed between the population variable and the number of cases. This is because the highest number of cases is observed in Cameroon, which is also the most populated state in this region [3]. On the other hand, it can also be observed that the least number of cases are observed in São Tomé and Príncipe, a country with the smallest population. This makes Cameroon the member with the highest risk in terms of COVID-19 spread in this region.

Table 2 Performance parameters of the models for Central Africa

Full size table

Figure 8 shows a plot of the model performance after prediction of the test data in various countries in the Central African region. In three countries, the LSTM model prediction generally matches well with the actual data. This implies that the best performance in this region was observed from the LSTM model. It is also observed that the worst model performance is given by the Prophet model, for example in Cameroon. In Chad, the ARIMA model performs relatively well in predicting the data, while in the rest of the countries, it comes immediately after the LSTM model.

Southern Africa

From this region, ten countries were used in this study. As shown in Fig. 3, the most densely populated country in this region is South Africa, with a population of 59308690. The least populated, on the other hand, is Eswatini, with a population of 1160164.

In Fig. 9, it is clearly observed that South Africa has the highest number of cases compared to other countries in the same region. This shows how fast the COVID-19 virus spreads in this country. This puts the other neighboring countries in the same region at a very high risk of having increased rates of spread of the virus. While the other countries in the same region are experiencing their second wave of virus spread, South Africa is observed to have three waves. Since it has the largest population, there is a positive correlation between the large number of cases observed and the large population.

For clarity, in Fig. 10, South Africa was excluded to be able to perform a comparative analysis of the COVID-19 state in other countries in the same region. It can be observed that, apart from South Africa, Zambia has the largest number of cases compared to other countries. It is also the first country to have an earlier increase in the number of cases. It is also observed that all countries have had their second major wave of COVID-19 spread. It is worth noting that the lowest number of cases was observed in Lesotho. Beyond the month of October, it is clearly observed that in all countries, there is a constant number of cases with the curves flattened. This clearly signifies the effects of some form of control of the spread by a number of practices, such as quarantines and vaccinations.

In Fig. 11, in three countries (Botswana, Malawi, and Mozambique), the LSTM model provided the best-matching prediction results. In Lesotho, the ARIMA model performed better than the other two models. The Prophet model emerged as the worst performer, as clearly observed in four countries: Malawi, Mozambique, Eswatini, and Lesotho. In these countries, this model predicts a roughly constant number of cases, with slight increases in the predicted number of cases. In Angola, both the LSTM and Prophet models produced slightly matching predictions close to the actual data, while the ARIMA model predicted a lower number of cases, quite different but also substantially close to the actual data. It is in this country that the three models show a significant uniformity in their predicted results. This can be generally attributed to the smooth rise in the number of cases in Angola, which makes it easier for all the models to capture the inherent data relationships and trends to be able to make better predictions.

In Fig. 12, it is observed that the ARIMA model performed the worst when compared to the other countries. This model made predictions that were generally higher than the actual data. In all four countries, the ARIMA model predicts a higher number of cases than the numbers predicted by the rest of the models. The LSTM model is also observed to provide the best performance with the best-matching predictions. The LSTM model is followed by the Prophet model, with the second-best prediction performance. In the South African region, the LSTM model is observed to provide the best overall prediction results compared to the ARIMA and Prophet models, as shown in both Figs. 11 and 12, while the worst prediction results are observed from the ARIMA model.

Table 3 displays the performance metrics used to determine the best prediction model in the Southern African region.

Table 3 Performance parameters of the models for Southern Africa

Full size table

Western Africa

In this research study, 12 countries from this region were used as case studies. In the Western region, Nigeria is the country with the largest population, with a total of 206139589 people. Guinea-Bissau, on the other hand, has the smallest population of 1968001 [3].

In Fig. 13, a comparative plot of the 12 countries used in this study from the Western region of Africa has been given. This shows the state of the COVID-19 pandemic in each of the 12 counties. It also displays the severity of the risk situation in terms of the COVID-19 spread given by the cumulative positive cases. It is observed that between the months of January 2020 and April of the same year, no COVID-19 cases were reported in this region. However, beyond the month of April of the same year, the first cases have begun to be reported. Notably, after this, in about four countries, which include Nigeria, Ghana, Senegal, and Mali, there is a sharp increase in the number of cases, while in the other countries there is a gradual increase in the number of cases. Nigeria, followed by Ghana and Senegal, displays the highest number of cases over time. Nigeria, being the most populated country with over 200 million people and the highest number of cases, is the riskiest member in this region. If immediate measures are not taken, there are higher chances of a faster spread to other countries too.

Figure 14 displays the prediction results of the three models in the region of Western Africa. In this first group of countries from this region, it can be observed that the LSTM model outperformed the other two models in producing the best-matching prediction results. This can be clearly observed in countries like Guinea, Guinea-Bissau, Gambia, Ghana, and Togo. In Burkina Faso, the Prophet model manages to make the most successful prediction. The ARIMA and Prophet are observed to make marching predictions in three countries: Guinea-Bissau, Ghana, and Togo. These predictions suggest a lower COVID-19 case number when compared to the actual data. This provides another proof of how these two models perform poorly when compared to the LSTM model. In Fig. 15, the second group of model predictions in the Western region of Africa is given. According to this figure, it can be observed that the best model prediction performance obtained in Niger is obtained from the Prophet model. This is the only country where this model performs best when its performance is compared to the remaining countries. It can also be concluded from this figure that the ARIMA model did not display any top performance in any of the countries. In all the six countries in this group in the Western region of Africa, the LSTM model maintains the best-matching prediction results, which continues to affirm the LSTM model as the top performing model in this region. In Nigeria, both the ARIMA and Prophet models make matching predictions against each other, which is still lower and significantly different from the actual data. These results prove the LSTM model to be the best prediction model in the West African region.

In Table 4, the prediction results based on the seven metrics used in this study for the three models are provided for the 12 countries from the Western region of Africa.

Table 4 Performance parameters of the models for Western Africa

Full size table

Eastern Africa

From this region, 12 countries were studied. Among these, the Comoros is observed to be the least populated country, with a population of 869601, while the most populated country is observed to be Ethiopia, with a population of 114963588 at the time of this study.

The cumulative positive COVID-19 cases for the countries in the Eastern region of Africa have been given in the plot in Fig. 16. It is notably clear that in this region, the highest number of cases is obtained in Ethiopia, which is followed by Kenya. It is worth noting that the population of Kenya, at 53771296 people immediately follows that of Ethiopia, while at the same time, its number of cumulative cases immediately follows that of Ethiopia, which means a roughly positive correlation between the population size and the number of confirmed cases. If proactive measures are not applied, the Eastern region is at a higher risk of experiencing a surge in the spread of COVID-19. In the region, there was a relatively late occurrence of the first cases, which is observed from the fact that the significant numbers of cases started to be registered just after the month of July in 2020 in all countries. In this region, Kenya is observed to have the highest number of waves of the COVID-19 spread. Apart from Ethiopia, Kenya, Uganda, Rwanda, Madagascar, and Sudan, the rest of the countries are observed to have a relatively slow increase in the number of cases reported. This can be due to varying measures that might have been taken by the respective countries and also the general population. For example, in the Comoros, the least populated country in this region.

Both Figs. 16 and 17 display the prediction results from the LSTM, ARIMA, and Prophet models in the 12 countries used in this study from the Eastern region of Africa. These results display both the plots of the predicted data by the models and the expected actual data. It is observed from Fig. 16 that all three models performed relatively well in the Comoros, followed by Sudan, as displayed in Fig. 17. In the rest of the countries, in both figures, it can be observed that the three models show significant relative discrepancies in performance. In Fig. 16, both the LSTM and ARIMA models obtained better match prediction results when compared with the Prophet model in Madagascar. In Fig. 16, the worst model performance is observed in both Djibouti and Madagascar by the Prophet model. On the other hand, the best model performance is evidently obtained by the LSTM model in all countries represented by the same figure. In Fig. 17 too, the LSTM model is observed to have the overall best-matching prediction results when compared to the ARIMA and Prophet models. In both Mauritius and Rwanda, the worst model performance can be observed from both the ARIMA and Prophet models. In this particular scenario, both models predicted extremely varied results from the actual data. These results conclude that the LSTM model outperformed the ARIMA and Prophet models in the Eastern region.

In Table 5, the three model performances have been given for the 12 countries from the Eastern region of Africa.

Table 5 Performance parameters of the models for Eastern Africa

Full size table

Figure 18 displays the overall combined model performance from all individual regions used in this study. It shows the percentage distributions both in the positive and negative directions to quantify each model’s performance depending on its contribution to the total error value for the seven error metrics used in this study. In both PSNR and R, good performance is indicated by having more distribution toward the positive direction, just as bad performance can be observed by having a more negative percentage distribution. For RMSE, MAPE, NRMSE, SMAPE, and the MSE errors, good performance can be observed in having smaller percentage distributions tending in the positive direction. On the other hand, bad performance for the models can be observed in having a large positive percentage distribution. The RMSE, MAPE, NRMSE, SMAPE, and MSE metrics clearly state that the overall best performance in this study was obtained by the LSTM model, followed by the ARIMA model, and lastly, the Prophet model. This is because the LSTM model is observed to have obtained the smallest percentage distribution of the total error in all these five metrics. The ARIMA model follows, with relatively larger percentage distributions than the LSTM model, but smaller compared to the Prophet model. The PSNR and R values also clarify that the LSTM model is observed to outperform the other two models. Both the PSNR and R values for the LSTM model tend toward the positive direction, showing that it achieved the highest values for these two metrics compared to the ARIMA and Prophet models. It is again followed by the ARIMA and, lastly, the Prophet model, respectively. The LSTM model's performance is owed to the fact that it can process and handle sequential data of all natures, while the other two models are affected by the quality of their inherent data properties. The ARIMA model works best with stationary data, and also requires a larger amount of data to fit well. With data that is not stationary, the ARIMA model performs poorly. The data used in this study was small in amount due to the fact that the COVID-19 pandemic is still a new ordeal with little data available. In most countries, the datasets were not significantly able to be made stationary, despite the differencing efforts to make them so during ARIMA model fitting. All of these factors contribute to its poor performance when compared to the LSTM model. On the other hand, in this study, it is observed that the overall worst-performing model is the Prophet model. Despite its ease of setup and not requiring data preprocessing, this Fourier series-based model failed to find and learn significant trends, seasonality, and holiday structures within the data to make best-matching predictions, which is because of the limited data available and given for training. The LSTM model's having several hyperparameter tuning points made it possible for it to be tuned until the best-matching results were reached. When compared to the other two models, the computational and time complexity of the LSTM model in order to achieve optimal results was the highest.

Forecasting for the Next 61 Days

In this study, after determining the best prediction model through the training and testing processes, the second major phase involved the forecasting of the cumulative positive cases by the best-performing model for each country for a period of 61 days. At the time of access to the main COVID-19 case dataset used in this study, the last date of the reported cases for each country in all regions was 2021-11-1. Cumulative positive cases were then forecasted from the last date of the original dataset up to the date of 2022-01-02 for each country in the five major regions of the African continent.