Introduction

The coronavirus disease (COVID-19) is an epidemic that first appeared in Wuhan, Hubei Province, China, on December 31, 2019. It was initially reported as a cluster of pneumonia cases. After a thorough analysis of the severity of the spread, the World Health Organization (WHO) declared COVID-19 as a pandemic on March 11, 2020, as described in the research from WHO timeline to COVID-19 [4]. COVID-19 is caused by the SARS-CoV-2 virus and can infect anyone. In most cases, patients infected with this disease will recover with no strong treatments applied. According to the findings by WHO [22], individuals may display symptoms ranging from low to average. Individuals with chronic medical illnesses, particularly the elderly, are more likely to experience severe symptoms of this.

The COVID-19 virus spreads from person to person via tiny fluid spread when an infected person coughs, sneezes, speaks, or breathes. When these fluids become stuck on surfaces such as door handles, the virus can be spread to others who come into contact with these surfaces without taking the necessary medical precautions. To prevent the spread of this virus, it is recommended that people keep a 1-m distance from other people, wash their hands frequently or use a disinfectant, wear a mask, and get vaccinated as recommended by WHO [22].

Various approaches have been deployed to prevent and control the spread of the COVID-19 pandemic. Among these strategies is the prediction of the spread of the COVID-19 virus. In this context, the spread of COVID-19 is considered a time series problem to which deep learning forecasting algorithms and big data statistical models are applied. Among the deep learning algorithms are the long short-term memory (LSTM) model as applied by Marzouk et al. [12], Hssayeni et al. [6], Yu et al. [24], Zeroual et al. [25], Pal et al. [14] and Shastri et al. [16]; the convolutional neural network (CNN) model as applied in research by Huang et al. [8], which performs well on image data such as X-ray images; the autoencoder model, which was applied by Hu [7]; gradient boosting, which provided the best results in research conducted by Zoabi et al. [27]; and the Prophet model, which was applied by P. Wang et al. [19] to perform epidemiological trend prediction. Big data statistical models include models such as the auto-regressive integrated moving average (ARIMA) model as applied by Gebretensae and Asmelash [5] and the susceptible-exposed infectious-removed (SEIR) model, which has been proven to be a robust model to predict the trend of COVID-19 as applied by Yang et al. [23]. Among the deep learning models used to perform time series prediction, the LSTM has been widely used due to its successful results in most research experiments. On the other hand, the ARIMA statistical model has also been widely applied in the health sector, for example, in a study by Y. W. Wang et al. [20] to predict the spread of hepatitis B disease, in the forecasting of medical service demand by Y. Huang et al. [9] and in the prediction of daily blood sampling room visits by Zhang et al. [26].

The following questions will be addressed by this research:

  1. 1.

    What is the best-performing prediction model given the COVID-19 cumulative positive cases data from African countries in five key regions?

  2. 2.

    Is it possible to estimate the total number of cumulative positive cases 61 days ahead of time using the best prediction model?

  3. 3.

    After a 61-day forecasting period, which countries on the African continent are in the most vulnerable position in terms of the COVID-19 virus's spread?

In this study, a comparative and analytical approach were followed to predict the spread of the COVID-19 virus. This approach includes two deep learning models and a statistical model. The deep learning models include LSTM and Prophet. The statistical model comprises the ARIMA model. In most studies, the modeled ARIMA model does not include the seasonal component of the problem. However, in this study, it is modeled to take into consideration the seasonal component of the time series problem. The spread of COVID-19 was considered to be a univariate time series problem using the number of COVID-19-positive cases. In “Model Selection Criteria”, the models used in this study are discussed in detail.

This study uses the African continent as a case study. In this comprehensive approach, the African continent was broken down into five major subregions, including Northern, Southern, Eastern, Western, and Central Africa. While most studies focus on a single or a few countries as a case study during the prediction of the spread of COVID-19, this study included and utilized all the African continent’s regions. In this study, the successful prediction model was selected by using seven performance indicators. The performance indicators include mean-square error (MSE), root mean-square error (RMSE), mean absolute percentage error (MAPE), symmetric mean absolute percentage error (SMAPE) R2 score, normalized root mean-square error (NRMSE), and peak signal-to-noise ratio (PSNR). In “The Framework Of The Applied Approach”, the performance metrics are provided in detail. The best-performing model was then used to perform the prediction of COVID-19 cases 61 days ahead of schedule. In “Results and Discussion”, the model results are provided and discussed in detail.

Related Work

In this section, prediction approaches and methods used in other research studies are addressed. These studies mainly concentrate on the prediction of the spread of COVID-19 using both statistical and deep learning tools.

In a research study by Gebretensae and Asmelash [5], the autoregressive integrated moving average (ARIMA) algorithm was used to forecast the spread of COVID-19 in Ethiopia. The autocorrelation function (ACF) and partial autocorrelation functions (PACF) were used to obtain the model’s optimal terms. It was observed that the ARIMA models, ARIMA (0, 1, 5) and ARIMA (2, 1, 3), produced the best results. Ribeiro et al. [15] developed a stacking-ensemble learning algorithm that included ARIMA, cubist regression, random forest, and support vector regression. In this study, the Gaussian process was employed as a meta-learner, while the random forest, ridge regression, and other algorithms were utilized as foundational learners. In this study, it was observed that the support vector regression algorithm produced the best results.

Abdulmajeed et al. [1] applied a deep learning ensemble method to predict COVID-19 cases in Nigeria. The emphasis in this study was to create a prediction method that uses as little data as possible to give accurate predictions. This was because there was a problem with limited training data for models to learn the COVID-19 spread. This deep learning approach combined four prediction approaches, which included one statistical method called ARIMA. Among the other deep learning models in the ensemble approach were the Prophet model (supported and provided by Facebook), the Holt–Winters exponential smoothing model, and the generalized autoregressive conditional heteroscedasticity (GARCH). While applying the ARIMA model, non-seasonal phenomena were used. To find the best ARIMA model, strategies such as brute search, autocorrelation function inspection, and partial autocorrelation function plots are used.

Wang et al. [19] used a hybrid prediction strategy to predict the COVID-19 cumulative cases in their study. This included the logistic and Prophet models. With the Prophet model, the primary focus was on modeling non-periodic changes. The model included the date and the total number of COVID-19 cases obtained from a specific country. The logistic model was used to identify the quickest rising point in the data in this hybrid method. The output of this model is then fed into the Prophet model, which is used to make the final forecast. Marzouk et al. [12] used three deep learning models to forecast the spread of COVID-19 in Egypt: the LSTM, convolutional neural network, and multilayer perceptron neural network. In this study, the COVID-19 data was modeled as a time series data. In this study, the LSTM outperformed the other two models.

Hssayeni et al. [6] used mobility data to predict the COVID-19 risk spread using the LSTM model and the gradient tree boosting model in their study. In this study, it was discovered that the number of daily cases decreased in the retiree context, while it increased in the youth context. Yang et al. [23], on the other hand, used the susceptible-exposed-infectious-removed (SEIR) and the LSTM models to forecast the spread of the COVID-19 pandemic in China. The SIER algorithm was used to model epidemiological and mobility data by specifying parameters, and the parameter was defined as the product of the daily number of people in contact with COVID-19 patients and the likelihood of transmission. σ was the amount of time it took for a COVID-19 patient to develop infection symptoms. Finally, γ was determined to be the average mortality or recovery rate. The rate of pandemic spread in Hubei province was determined using these parameters. These parameters were then fed into the LSTM model as input.

Zeroual et al. [25] used five models to predict new and recovered COVID-19 cases. The recurrent neural network, long short-term memory, bidirectional LSTM, gated recurrent units, and variational autoencoder were among the models used. The study was carried out in six different countries: Italy, Spain, France, China, the USA, and Australia. The variational autoencoder model produced the best results. The best model was used to forecast cases for the next 2 weeks. To forecast the positive COVID-19 outcome in a PCR test, Zoabi et al. [27] used the gradient-boosting algorithm in conjunction with the Shapley additive explanations (SHAP) bee-swarm plot. Sex, contact with COVID-19 patients, and the presence of the five most notable COVID-19 symptoms were all model input features. Techniques such as early stopping were used to improve the results.

Pal et al. [14] used the LSTM model and Bayesian optimization to determine COVID-19 risk categories. To obtain the hyperparameters, the search space had to be defined. The optimal hyperparameters were obtained and used by the model in the local trend prediction phase to perform country-specific predictions. Finally, a fuzzy rule-based risk categorization process was carried out, in which the data obtained from the previous module was used to determine each country’s risk status. This study concluded that weather had no significant impact on the spread of COVID-19.

Shastri et al. [16] conducted research on COVID-19 time series prediction and comparative analysis using variants of long short-term memory neural network models. Among them were models such as bidirectional long short-term memory, convolutional long short-term memory, and stacked long short-term memory. Two countries were used as case studies. Among these are the USA and India. Because models are sensitive to the size of data input values, tools like MinMaxScaler were used to perform data normalization. Various regions of the USA and India were divided into groups based on the severity of the COVID-19 situation. These were the initial, moderate, and severe groups. Regions with a high number of COVID-19 cases were classified as severe. When compared to the other two models, the convolutional LSTM model produced the best results.

In the related literature, several models have been used to forecast the spread of COVID-19 in a couple of countries. However, the African continent has not been extensively studied in this regard. This study aimed to close this gap by applying the most successful model (LSTM) among the rest of the forecasting models to conduct an extensive investigation and analysis of African states from the five major regions of the continent. In addition, the most critical states with the highest expected COVID-19 increase rate from each region were identified for immediate action in the region.

Methods and Materials

Data Gathering

Africa’s Geographical Regions and Populations

The case studies used in this study included countries from the five major regions of the African continent. These regions, as depicted in Figure 1, include the Northern, Eastern, Southern, Central, and Western regions.

Fig. 1
figure 1

Africa’s five major regions

Much work on the COVID-19 pandemic has been done in the literature. In some research, several or individual African countries have been used as case studies, for example, research done by Abdulmajeed et al. [1]. In this study, the African continent is considered from a broader perspective, including countries from each of the major regions that make up the continent. This study performs a comparative analysis of the COVID-19 pandemic spread.

COVID-19 Data

A humanitarian data exchange [2] source provided the COVID-19 dataset used in this research. This information was gathered by first splitting the data of each country into distinct groups based on the country's geography. The Northern, Southern, Central, Eastern, and Western regions of Africa were used in the study. Model fitting was then done for each country separately. This data was split into training and testing datasets, with the former accounting for 80% of the total prediction models.

ARIMA Model

The ARIMA model is made up of three main parts: the terms “AR,” “I,” and “MA” are among these elements. As mentioned by Noureen et al. [13], the “AR” term refers to the autoregression parameter. This shows that the variable under consideration in this context has a linear relationship between its present and prior values. That is to say, an AR(1) of order one implies that the current data point in the series is based directly on the immediate past data point, while an AR(2) implies that it is based on two past data points in the series by Kırbaş et al. [10]. The "I" component stands for the integrated element, which shows the amount of difference between the current data points and their preceding values. This is part of the ARIMA model that handles the data stationarity requirement for better results in ARIMA time series processing, which is attained by the differencing process as explained in the research by the Noureen et al. [13]. Stationarity in ARIMA processing refers to the condition when the mean and variance statistical parameters in the time series data are constant with respect to the time factor. The last part in the basic ARIMA structure is the "MA" part, which represents the moving average. This component displays the linear combination that exists between the error values at past intervals in the time series as denoted by Ribeiro et al. [15]. The standard notation of the basic ARIMA model is denoted as ARIMA (p, d, q). The p, d, and q terms represent the autoregressive, differencing, and moving average terms as described in the research by Abdulmajeed et al. [1]. The mathematical notation for the AR (p) term can be represented as shown in Eq. 1.

$$Y_{t} = \delta + \varphi_{{{1} }} Y_{t - 1} + \varphi_{{2}} Y_{t - 2} + \cdots + \varphi_{p} Y_{t - p} + \varepsilon_{t} .$$
(1)

In the above equation, Yt denotes the time series value at a given time point t. The p, δ, and εt denote the autoregression term, fixed value, and the error value, respectively. The moving average component can be defined mathematically in Eq. 2.

$$Y_{t} = \mu + \varepsilon_{t} + \theta {}_{{1}} \varepsilon_{t - 1} + \cdots + \theta_{{2}} \varepsilon_{t - 2} + \theta_{q} \varepsilon_{t - q} .$$
(2)

In Eq. 2, q depicts the order of the moving average term. The difference term d can be obtained from Eq. 3.

$$\Delta Y_{t} = Y_{t} - Y_{t - 1} = Y_{t} - LY_{t} .$$
(3)

In Eq. 3, ∆Yt denotes the stationary time series value at a time interval t.

$$(1 - \varphi_{1} L - \varphi_{1} L_{2} - ... - \, \varphi_{p} L_{q} )\Delta dY_{t} = \delta + \theta_{1} \, \varepsilon_{t - 1} + \cdots + \theta_{q} \varepsilon_{t - q} .$$
(4)

Equation 4 is a combination of all the equations for the basic ARIMA model terms. This denoted the full ARIMA (p, d, q) model equation with the complete set of terms computed and represented.

The partial autocorrelation function (PACF) and autocorrelation function (ACF) graphs, as shown in Fig. 2, can also be used to obtain the ARIMA model's p and q terms. The ACF plot is a graphical representation of the average correlation between data and prior values in a time series over different lag intervals. The only difference between the two exists in the fact that PACF reveals correlations within a shorter lag interval, as explained in the research by Noureen et al. [13].

Fig. 2
figure 2

Representation of PACF and ACF plots

Prophet Model

The Prophet model is a deep learning model for time series forecasting. The Facebook group created and maintains this model as an open-source initiative. According to Taylor and Letham [18], it is based on the generic specification of a generative additive model (GAM), which is a linear regression model whose linear variable is reliant on smoothing functions. GAMs can be quantitatively represented using Eq. 5.

$$g\left( {E\left( Y \right)} \right) \, = \, \beta 0 \, + f1\left( {x1} \right) \, + f2\left( {x2} \right) \, + \cdots + fm\left( {xm} \right).$$
(5)

In Eq. 6, Y represents the univariate response variable, x1 represents the predictor variable, and f1 represents the smoothing functions. Due to its use of GAM model formulation, the Prophet model has a variety of benefits, including flexibility and quick fitting times, and evaluates a time series problem from three perspectives, including trend, seasonality, and holiday components, as discussed in research carried out by Taylor and Letham [18]. The trend component takes into account the likelihood of time series data increasing or decreasing over time. Seasonality, on the other hand, looks at data changes that happen over a short time period.

$$y(t) = g(t) + s(t) + h(t) + \varepsilon_{t} .$$
(6)

The final predicted value y(t) is obtained from a combination of the trend, seasonal and holiday component functions as shown in Eq. 6 above, where εt represents the changes that are not captured by the model [18].

LSTM Model

The LSTM model is composed of three main core components. These include the forget gate, input gate, and output gate [16]. The forget gate identifies the degree to which past data is obliterated. The input gate receives the data that is taken into the cell’s internal state, while the output gate is used to create the next hidden state or output that is obtained from the existing internal state value.

The above figure displays the major building blocks of the LSTM model. It is evident that the main building blocks of the LSTM model consist of the forget gate, input gate and output gates as described by Le and Lee [11]. Several activation functions are used such as the tanh and sigmoid functions for extracting the optimal model weight values.

Model Selection Criteria

In this study, seven metrics were adopted to assess the predictive performance of the models. These metrics include, the peak signal-to-noise ratio (PSNR), mean-squared error (MSE), root mean-square error (RMSE), symmetric mean absolute percentage error (SMAPE), mean absolute percentage error (MAPE), normalized root mean-square error (NRMSE), and R2 score.

Mean-Square Error

The mean-squared error can be calculated numerically as below.

$${\text{MSE}}{\mkern 1mu} = {\mkern 1mu} \frac{1}{n}\sum\limits_{{1{\kern 1pt} = {\kern 1pt} 1}}^{n} {\left( {Y_{i}^{ \wedge } - Y_{i} } \right)^{2} }$$
(7)

The overall number of observations \(n\), the exact value Y, and the anticipated value Y^ are all represented in Eq. 7.

Root Mean-Square Error

The RMSE can be calculated using Eq. 8.

$${\text{RMSE}}\,{ = }\,\sqrt {\frac{{1\sum\nolimits_{1\, = \,1}^{n} {\left( {Y_{i}^{ \wedge } - Y_{i} } \right)^{2} } }}{n}}$$
(8)

The overall number of observations \(n\), the actual value Y, and the anticipated value Y^ are all represented in Eq. 8.

Mean Absolute Percentage Error

Equation 12 can be used to represent this performance measure numerically.

$${\text{MAPE}}\, = \,\frac{100\% }{n}\mathop \sum \limits_{t = 1}^{n} \left| {\frac{{A_{t} - F_{t} }}{{A_{t} }}} \right|.$$
(9)

The observed vector of numbers is represented by \(A_{t}\), the projected value is expressed by \(F_{t}\), and the total number of data points is represented by \(n\) in Eq. 9.

Symmetric Mean Absolute Percentage Error

Equation 10 can be used to represent this measurement numerically.

$$SMAPE = \,\,\frac{100\% }{n}\mathop \sum \limits_{t = 1}^{n} \frac{{\left| {F_{t} - A_{t} } \right|}}{{\left( {\left| {A_{t} } \right| + \left| {F_{t} } \right|} \right)/2}}.$$
(10)

The observed vector numbers are represented by \(A_{t}\), the forecasted value is represented by \(F_{t}\), and the overall number of observations is represented by \(n\) in Eq. 10.

Peak Signal-to-Noise Ratio

$$PSNR\, = \,20log_{10} \left( {\frac{{MAX_{f} }}{{\sqrt {MSE} }}} \right).$$
(11)

The highest signal value is expressed by \(MAX_{f}\) in Eq. 11. \(MSE\) stands for mean-square error.

Normalized Root Mean-Square Error

$$NRMSE = \frac{RMSD}{{Y_{max} - Y_{min} }}.$$
(12)

The root mean-square deviation (RMSD) is defined in Eq. 12. The RMSD measure is also known as the RMSE statistic (Fig. 3).

Fig. 3
figure 3

General structure of an LSTM model

R2 Score

$$R_{2} \, = \,\frac{{\mathop \sum \nolimits_{i} \left( {y_{i} - f_{i} } \right)^{2} }}{{\mathop \sum \nolimits_{i} \left( {y_{i} - Y^{i} } \right)^{2} }}.$$
(13)

In Eq. 13, the projected values are represented by \(f_{i}\), whereas the original values are represented by \(y_{i}\), and the mean is represented by \(Y^{i}\).

The Framework of Applied Approach

In Fig. 4, the major stages of this study include splitting the preprocessed positive COVID-19 cumulative cases data into 80% training and 20% testing datasets, fitting the models, validating the model performance using the performance, and then selecting the best-performing model to use it to forecast the future positive COVID-19 cases for the next 61 days.

Fig. 4
figure 4

Structural depiction of the methodology used in this study

Rationale for the Selected Models

This section aims to address the reasons for choosing the LSTM, ARIMA, and Prophet models to perform the prediction and forecasting of the COVID-19 cumulative positive cases data for the various African countries in this study.

LSTM

This model is a special class of recurrent neural network deep learning models with the capability to identify and learn the relationship that exists within a given series of data observations, as described in the research by Yu et al. [24]. This is possible because the LSTM has memory modules that act as a connection between past and current data points. Important data points with strong desired insights are retained, while those with weaker weights are disposed of in the forget module of the LSTM model. This both optimizes the model to concentrate on extracting the dependence that exists within a given input sequence and also minimizes the error by eliminating noise points from the learned data at this stage. As described by Zeroual et al. [25], the LSTM model eliminates the problem of vanishing gradients that is faced with traditional recurrent neural networks, whereby the computed gradient fluctuates within peak ranges, that is to say, either too big or too small. According to Zeroual et al. [25], this issue arises during the training phase. The LSTM model solves the vanishing gradient problem with the help of activation vectors used in the forget gate to determine the gradient values. It is at this point that the LSTM model, by using a summative strategy, identifies the optimal terms to adjust at a given time step, which improves accuracy and overall performance. The LSTM model implementation provides several hyperparameters, such as the batch and epoch numbers, which can be easily adjusted to obtain better results. This makes it easy to fit and use the LSTM model to achieve accurate results. These robust qualities of the LSTM model make it ideal for performing the time series prediction task.

ARIMA

This is a statistical method that uses regression in which past data points and errors are connected using weight factors, which improves the overall prediction results, as described in the research by Singh et al. [17]. This model also amalgamates the strengths of both the autoregression and moving average models, which further makes it a robust choice that extracts the inherent statistical relationship between the dependent and independent variables. It is a flexible model to use, since it incorporates the difference between data points both in the past and present context, which makes it able to handle and process data which is not stationary using a few parameters as described by Abdulmajeed et al. [1]. Another factor lies in the fact that it is easier to obtain the optimal parameter terms of this model using simple methods like the PACF and ACF plots, as described in the research by Gebretensae and Asmelash [5]. Also, metrics such as the Akaike information criteria and Bayesian information criteria make it possible to measure how good the ARIMA model is for a given combination of hyperparameter terms, which further makes it easier to streamline the prediction results. This model has the ability to process data with seasonal trends by further increasing the hyperparameter terms to include the seasonal factors, as explained by Y. Wang et al. [21]. This makes it possible to capture any seasonal relationship within the COVID-19 dataset at any given time.

Prophet

According to Abdulmajeed et al. [1], this is an additive regression model supported by Facebook with a robust architecture that takes into account seasonal dynamics within a given data sequence, such as yearly, weekly, and daily trends. It also handles data with missing data points and extreme values well, since it has the ability to identify data anomalies as described by Y. Wang et al. [21]. This makes it an ideal solution to process and predict the COVID-19 datasets in some countries with data of this nature, such as data that has sharp spikes from the normal trend in the general data. According to research by Letham and Taylor [18], the Prophet model has built-in computational support that handles non-linear growth curves when the natural boundary is reached and also offers flexibility in tuning, such as smoothing features that capture and model seasonality constraints in the data to make a good fit regarding historical cycles. It is also easy to capture and model the effects of events such as holidays in the time series data with the Prophet model using limited data [18]. These qualities make this model appropriate to perform the prediction of the COVID-19 spread.

Results and Discussion

In this study, countries from the African continent were grouped into the five groups named in “Data Gathering”. Three forecasting models were used, including the ARIMA, LSTM, and Prophet. In this section, the performance results obtained from these models are given for each region of Africa.

Model Training and Testing

Northern Africa

In the Northern region of Africa, of the six countries studied, the most densely populated country is Egypt, as shown in Fig. 2, with a population of 102334404, while the least populated country is Mauritania, with a population of 4649658 as observed in the work by Worldometer [3].

In Fig. 5, it can be seen that Morocco has maintained the highest number of COVID-19 cases over time. This was followed by Tunisia in this critical condition. On the other hand, Mauritania, on the other hand, has the lowest number of cases over time compared to other states in this region.

Fig. 5
figure 5

Cumulative positive cases for Northern Africa

Libya has a relative increase in cases, with a gradual increase occurring between the months of October 2020 and July 2021. Beyond the month of July, a sharp increase that slowly reduces toward the month of October is observed. This clearly describes the first wave of COVID-19 cases in Libya. Algeria's trend is more similar to that of Libya’s. However, it is observed that the cases reach a constant number, while in Libya there is an increase.

According to Fig. 6, it is observed that the LSTM model fits better than both the ARIMA and Prophet models. In Tunisia, it can be observed that the Prophet model performs the worst in predicting the test data. This is because while the test data flattens to a constant case value, the Prophet model predicts a sharp increase of over 800000 cases. In countries like Egypt and Tunisia, the ARIMA and Prophet models predicted lower and higher cases, respectively, with respect to the actual data. Apart from these two countries, in the four other countries, both models predicted lower cumulative positive cases with regard to the actual data. This confirms the poor performance of these two models when compared to the LSTM model, which predicts better results close to the actual data in five countries except Egypt.

Fig. 6
figure 6

Actual and predicted cumulative cases in Northern Africa

In Table 1, the best results in terms of the PSNR and R value can be observed with larger numbers, which implies that the greater the number, the better is the model’s relative performance.

Table 1 Performance parameters of the models for Northern Africa

Central Africa

In this region, five states were studied. At the time of this study, the most populated state in this group was Cameroon, with a population of 26545863 [3]. On the other hand, the least populated state is São Tomé and Príncipe, whose population is 219,159.

In Fig. 7 above, the COVID-19 cumulative cases from the five countries in this region have been given. According to this graph, COVID-19 cases in Cameroon are higher than in the rest of the countries, with more than two significant waves. Cameroon is followed by Gabon, which also has more than two waves. The rest of the countries maintain a slightly constant curve, with minor increases in COVID-19 cases. The lowest number of cases is seen in São Tomé and Príncipe. A positive correlation is observed between the population variable and the number of cases. This is because the highest number of cases is observed in Cameroon, which is also the most populated state in this region [3]. On the other hand, it can also be observed that the least number of cases are observed in São Tomé and Príncipe, a country with the smallest population. This makes Cameroon the member with the highest risk in terms of COVID-19 spread in this region.

Fig. 7
figure 7

Cumulative positive cases for Central Africa

Table 2 Performance parameters of the models for Central Africa

Figure 8 shows a plot of the model performance after prediction of the test data in various countries in the Central African region. In three countries, the LSTM model prediction generally matches well with the actual data. This implies that the best performance in this region was observed from the LSTM model. It is also observed that the worst model performance is given by the Prophet model, for example in Cameroon. In Chad, the ARIMA model performs relatively well in predicting the data, while in the rest of the countries, it comes immediately after the LSTM model.

Fig. 8
figure 8

Actual and predicted cumulative cases in Central Africa

Southern Africa

From this region, ten countries were used in this study. As shown in Fig. 3, the most densely populated country in this region is South Africa, with a population of 59308690. The least populated, on the other hand, is Eswatini, with a population of 1160164.

In Fig. 9, it is clearly observed that South Africa has the highest number of cases compared to other countries in the same region. This shows how fast the COVID-19 virus spreads in this country. This puts the other neighboring countries in the same region at a very high risk of having increased rates of spread of the virus. While the other countries in the same region are experiencing their second wave of virus spread, South Africa is observed to have three waves. Since it has the largest population, there is a positive correlation between the large number of cases observed and the large population.

Fig. 9
figure 9

Cumulative positive cases in the Southern African region including South Africa

For clarity, in Fig. 10, South Africa was excluded to be able to perform a comparative analysis of the COVID-19 state in other countries in the same region. It can be observed that, apart from South Africa, Zambia has the largest number of cases compared to other countries. It is also the first country to have an earlier increase in the number of cases. It is also observed that all countries have had their second major wave of COVID-19 spread. It is worth noting that the lowest number of cases was observed in Lesotho. Beyond the month of October, it is clearly observed that in all countries, there is a constant number of cases with the curves flattened. This clearly signifies the effects of some form of control of the spread by a number of practices, such as quarantines and vaccinations.

Fig. 10
figure 10

Cumulative positive cases in the Southern African region excluding South Africa

In Fig. 11, in three countries (Botswana, Malawi, and Mozambique), the LSTM model provided the best-matching prediction results. In Lesotho, the ARIMA model performed better than the other two models. The Prophet model emerged as the worst performer, as clearly observed in four countries: Malawi, Mozambique, Eswatini, and Lesotho. In these countries, this model predicts a roughly constant number of cases, with slight increases in the predicted number of cases. In Angola, both the LSTM and Prophet models produced slightly matching predictions close to the actual data, while the ARIMA model predicted a lower number of cases, quite different but also substantially close to the actual data. It is in this country that the three models show a significant uniformity in their predicted results. This can be generally attributed to the smooth rise in the number of cases in Angola, which makes it easier for all the models to capture the inherent data relationships and trends to be able to make better predictions.

Fig. 11
figure 11

Actual and predicted cumulative positive cases for Southern Africa (a)

In Fig. 12, it is observed that the ARIMA model performed the worst when compared to the other countries. This model made predictions that were generally higher than the actual data. In all four countries, the ARIMA model predicts a higher number of cases than the numbers predicted by the rest of the models. The LSTM model is also observed to provide the best performance with the best-matching predictions. The LSTM model is followed by the Prophet model, with the second-best prediction performance. In the South African region, the LSTM model is observed to provide the best overall prediction results compared to the ARIMA and Prophet models, as shown in both Figs. 11 and 12, while the worst prediction results are observed from the ARIMA model.

Fig. 12
figure 12

Actual and predicted cumulative positive cases in Southern Africa (b)

Table 3 displays the performance metrics used to determine the best prediction model in the Southern African region.

Table 3 Performance parameters of the models for Southern Africa

Western Africa

In this research study, 12 countries from this region were used as case studies. In the Western region, Nigeria is the country with the largest population, with a total of 206139589 people. Guinea-Bissau, on the other hand, has the smallest population of 1968001 [3].

In Fig. 13, a comparative plot of the 12 countries used in this study from the Western region of Africa has been given. This shows the state of the COVID-19 pandemic in each of the 12 counties. It also displays the severity of the risk situation in terms of the COVID-19 spread given by the cumulative positive cases. It is observed that between the months of January 2020 and April of the same year, no COVID-19 cases were reported in this region. However, beyond the month of April of the same year, the first cases have begun to be reported. Notably, after this, in about four countries, which include Nigeria, Ghana, Senegal, and Mali, there is a sharp increase in the number of cases, while in the other countries there is a gradual increase in the number of cases. Nigeria, followed by Ghana and Senegal, displays the highest number of cases over time. Nigeria, being the most populated country with over 200 million people and the highest number of cases, is the riskiest member in this region. If immediate measures are not taken, there are higher chances of a faster spread to other countries too.

Fig. 13
figure 13

Cumulative positive cases in Western Africa

Figure 14 displays the prediction results of the three models in the region of Western Africa. In this first group of countries from this region, it can be observed that the LSTM model outperformed the other two models in producing the best-matching prediction results. This can be clearly observed in countries like Guinea, Guinea-Bissau, Gambia, Ghana, and Togo. In Burkina Faso, the Prophet model manages to make the most successful prediction. The ARIMA and Prophet are observed to make marching predictions in three countries: Guinea-Bissau, Ghana, and Togo. These predictions suggest a lower COVID-19 case number when compared to the actual data. This provides another proof of how these two models perform poorly when compared to the LSTM model. In Fig. 15, the second group of model predictions in the Western region of Africa is given. According to this figure, it can be observed that the best model prediction performance obtained in Niger is obtained from the Prophet model. This is the only country where this model performs best when its performance is compared to the remaining countries. It can also be concluded from this figure that the ARIMA model did not display any top performance in any of the countries. In all the six countries in this group in the Western region of Africa, the LSTM model maintains the best-matching prediction results, which continues to affirm the LSTM model as the top performing model in this region. In Nigeria, both the ARIMA and Prophet models make matching predictions against each other, which is still lower and significantly different from the actual data. These results prove the LSTM model to be the best prediction model in the West African region.

Fig. 14
figure 14

Actual and predicted cumulative positive cases in Western Africa (a)

Fig. 15
figure 15

Actual and predicted cumulative cases in Western African (b)

In Table 4, the prediction results based on the seven metrics used in this study for the three models are provided for the 12 countries from the Western region of Africa.

Table 4 Performance parameters of the models for Western Africa

Eastern Africa

From this region, 12 countries were studied. Among these, the Comoros is observed to be the least populated country, with a population of 869601, while the most populated country is observed to be Ethiopia, with a population of 114963588 at the time of this study.

The cumulative positive COVID-19 cases for the countries in the Eastern region of Africa have been given in the plot in Fig. 16. It is notably clear that in this region, the highest number of cases is obtained in Ethiopia, which is followed by Kenya. It is worth noting that the population of Kenya, at 53771296 people immediately follows that of Ethiopia, while at the same time, its number of cumulative cases immediately follows that of Ethiopia, which means a roughly positive correlation between the population size and the number of confirmed cases. If proactive measures are not applied, the Eastern region is at a higher risk of experiencing a surge in the spread of COVID-19. In the region, there was a relatively late occurrence of the first cases, which is observed from the fact that the significant numbers of cases started to be registered just after the month of July in 2020 in all countries. In this region, Kenya is observed to have the highest number of waves of the COVID-19 spread. Apart from Ethiopia, Kenya, Uganda, Rwanda, Madagascar, and Sudan, the rest of the countries are observed to have a relatively slow increase in the number of cases reported. This can be due to varying measures that might have been taken by the respective countries and also the general population. For example, in the Comoros, the least populated country in this region.

Fig. 16
figure 16

Cumulative positive cases for Eastern Africa

Both Figs. 16 and 17 display the prediction results from the LSTM, ARIMA, and Prophet models in the 12 countries used in this study from the Eastern region of Africa. These results display both the plots of the predicted data by the models and the expected actual data. It is observed from Fig. 16 that all three models performed relatively well in the Comoros, followed by Sudan, as displayed in Fig. 17. In the rest of the countries, in both figures, it can be observed that the three models show significant relative discrepancies in performance. In Fig. 16, both the LSTM and ARIMA models obtained better match prediction results when compared with the Prophet model in Madagascar. In Fig. 16, the worst model performance is observed in both Djibouti and Madagascar by the Prophet model. On the other hand, the best model performance is evidently obtained by the LSTM model in all countries represented by the same figure. In Fig. 17 too, the LSTM model is observed to have the overall best-matching prediction results when compared to the ARIMA and Prophet models. In both Mauritius and Rwanda, the worst model performance can be observed from both the ARIMA and Prophet models. In this particular scenario, both models predicted extremely varied results from the actual data. These results conclude that the LSTM model outperformed the ARIMA and Prophet models in the Eastern region.

Fig. 17
figure 17figure 17

Actual and predicted cumulative positive cases for Eastern Africa (a). Actual and predicted cumulative positive cases for Eastern Africa (b)

In Table 5, the three model performances have been given for the 12 countries from the Eastern region of Africa.

Table 5 Performance parameters of the models for Eastern Africa

Figure 18 displays the overall combined model performance from all individual regions used in this study. It shows the percentage distributions both in the positive and negative directions to quantify each model’s performance depending on its contribution to the total error value for the seven error metrics used in this study. In both PSNR and R, good performance is indicated by having more distribution toward the positive direction, just as bad performance can be observed by having a more negative percentage distribution. For RMSE, MAPE, NRMSE, SMAPE, and the MSE errors, good performance can be observed in having smaller percentage distributions tending in the positive direction. On the other hand, bad performance for the models can be observed in having a large positive percentage distribution. The RMSE, MAPE, NRMSE, SMAPE, and MSE metrics clearly state that the overall best performance in this study was obtained by the LSTM model, followed by the ARIMA model, and lastly, the Prophet model. This is because the LSTM model is observed to have obtained the smallest percentage distribution of the total error in all these five metrics. The ARIMA model follows, with relatively larger percentage distributions than the LSTM model, but smaller compared to the Prophet model. The PSNR and R values also clarify that the LSTM model is observed to outperform the other two models. Both the PSNR and R values for the LSTM model tend toward the positive direction, showing that it achieved the highest values for these two metrics compared to the ARIMA and Prophet models. It is again followed by the ARIMA and, lastly, the Prophet model, respectively. The LSTM model's performance is owed to the fact that it can process and handle sequential data of all natures, while the other two models are affected by the quality of their inherent data properties. The ARIMA model works best with stationary data, and also requires a larger amount of data to fit well. With data that is not stationary, the ARIMA model performs poorly. The data used in this study was small in amount due to the fact that the COVID-19 pandemic is still a new ordeal with little data available. In most countries, the datasets were not significantly able to be made stationary, despite the differencing efforts to make them so during ARIMA model fitting. All of these factors contribute to its poor performance when compared to the LSTM model. On the other hand, in this study, it is observed that the overall worst-performing model is the Prophet model. Despite its ease of setup and not requiring data preprocessing, this Fourier series-based model failed to find and learn significant trends, seasonality, and holiday structures within the data to make best-matching predictions, which is because of the limited data available and given for training. The LSTM model's having several hyperparameter tuning points made it possible for it to be tuned until the best-matching results were reached. When compared to the other two models, the computational and time complexity of the LSTM model in order to achieve optimal results was the highest.

Fig. 18
figure 18

Total error distribution of the models

Forecasting for the Next 61 Days

In this study, after determining the best prediction model through the training and testing processes, the second major phase involved the forecasting of the cumulative positive cases by the best-performing model for each country for a period of 61 days. At the time of access to the main COVID-19 case dataset used in this study, the last date of the reported cases for each country in all regions was 2021-11-1. Cumulative positive cases were then forecasted from the last date of the original dataset up to the date of 2022-01-02 for each country in the five major regions of the African continent.

Northern Africa

As displayed in Fig. 19, the COVID-19 cumulative positive cases are expected to have a fast increasing rate in Egypt as well. While in countries like Tunisia, Algeria, and Mauritania, cases are expected to maintain a flat rate of increase, in Libya it is expected to show a gradual increase in the rate of increase. In Morocco, a notable slight decrease is expected, after which a constant number of cases with a small increase at the end of the forecasting period is expected. At the end of the prediction period, all these countries in Northern Africa that reported cumulative cases are expected to show an increase. In Algeria, Mauritania, Tunisia, Egypt, Libya, and Morocco, cases are expected to increase from 206452 to 208009, 37320–38250, 712747–716835, 331017–370164, 357338–369986, and from 946145 to 947226, respectively. With an 11.83% increase in the number of cases at the end of the forecasting period, it is observed that Egypt is the country in this region with the largest expected increase in the number of COVID-19 cumulative positive cases.

Fig. 19
figure 19

Actual and forecasted COVID-19 cumulative positive cases for Northern Africa

Central Africa

In Fig. 20, the forecasted cases for the five Central African countries have been plotted. In Cameroon, the cases are expected to slightly drop to a constant rate of increase. In Gabon and Equatorial Guinea, a gradual increase is expected, while in Chad and São Tomé and Príncipe, a constant rate of change in the cases is expected. At the end of the forecasting period in Cameroon, a decrease in the number of cases is expected to occur from 102,499 to 102,129. In the Central African region, Cameroon is the only country with an expected decrease in the number of cases.

Fig. 20
figure 20

Actual and forecasted COVID-19 cumulative positive cases for Central Africa

The rest of the countries are expected to experience an increase in the number of cases as well. Cases are expected to increase from 35525 to 36522, 5069–5072, 13368–13508 and 3714–3717 in Gabon, Chad, Equatorial Guinea and São Tomé and Príncipe respectively. The largest increase in the number of cases in this region is expected to occur in Gabon, with an expected percentage increase of 2.81%.

Southern Africa

For the sake of clarity, countries from the Southern African region were separated into two plots showing the forecasted cumulative cases. This is because the number of cases in South Africa is so much bigger than in the rest of the countries in this region. This would result in plots for other countries being stacked together and not being able to be examined. In Fig. 21, a plot for the actual and forecasted cumulative cases for seven countries in the Southern African region is provided. According to this figure, it is observed that in Angola, the expected rate of increase in the cumulative positive cases is higher than in the rest of the countries. Angola is followed by Lesotho, with a moderate rate of increase in the number of cumulative cases. Lesotho is also followed by Botswana, with a small but notable increase in the cumulative cases. The rest of the countries, apart from these three, are observed to maintain a constant number of cases with insignificant increases.

Fig. 21
figure 21

Actual and forecasted COVID-19 cumulative positive cases for Southern Africa (a)

Figure 22 is a continuation of Fig. 21, which also shows a plot of the forecasted cases and actual cases for three countries in the Southern African region. In both Zambia and Mozambique, the number of cumulative cases is expected to maintain a constant course while a significant gradual increase in the number of cumulative cases is expected to occur. At the end of the forecasting period among the countries of this region, it is only in Mozambique that the number of COVID-19 cumulative cases is expected to decrease from 151292 to 151051. In the rest of the countries, the cases are expected to increase. In Angola, Botswana, Malawi, Namibia, South- Africa, Zambia, Eswatini, Lesotho, and Zimbabwe, the number of cases is expected to increase from 64433 to 76655, 186594 to 193024, 61796 to 63201, 128886 to 129401, 209734 to 210955, 46421 to 46874, 21635 to 24334, and 132977 to 133267, respectively. In this region, the highest percentage increase is observed to be 18.97% from Angola.

Fig. 22
figure 22

Actual and forecasted COVID-19 cumulative positive cases for Southern Africa (b)

Eastern Africa

Forecasted cases in the Eastern region of Africa have been plotted in two separate graphs (Figs. 23 and 24). This made it possible to analyze and observe clearly the forecasted cases in all countries studied in this region.

Fig. 23
figure 23

Actual and forecasted COVID-19 cumulative positive cases for Eastern Africa (a)

Fig. 24
figure 24

Actual and forecasted COVID-19 cumulative positive cases for Eastern Africa (b)

In Fig. 23, a plot of the actual and forecasted cases for seven countries from the Eastern African region has been given. This forecast has been produced by the top performing model, which is the LSTM in most countries. According to this forecast, it is observed that in two countries, Rwanda and Mauritius, there is an expected gradual increase in the rate of increase of cumulative positive cases. Apart from these two countries and Djibouti, which are expected to have the same number of cases, the rest of the countries are expected to have small fluctuations in the number of cases.

In Fig. 24, five countries in the Eastern region of Africa have been shown with their respective COVID-19 cumulative positive cases. In Kenya, a constant number of cases is expected, while in Ethiopia and Somalia, a notable increase is expected to occur. In both Uganda and Sudan, a small increase, which will be followed by a small but significant decrease, is expected to take place.

At the end of the forecasting period in Djibouti, the cases are expected to remain constant. The previous number in the original dataset was 13478 cases, which was expected to remain the same at the end of the forecast for Djibouti. In Eritrea, a small decrease is expected to happen from 6834 to 6820 cases. On the other hand, in the rest of the countries, an increase is expected by the end of the forecasting period. In these countries, Uganda, Sudan, Madagascar, Kenya, South Sudan, Somalia, Rwanda, Mauritius, Ethiopia, and Comoros, cases are expected to increase from 126236 to 127628, 40433 to 40598, 43626 to 44150, 253310 to 253901, 12410 to 12761, 21998 to 24356, 99698 to 102205, 17812 to 18297, 365167 to 377935, and 4259 to 4472, respectively. The highest expected increase in the cumulative number of cases is observed to take place in Somalia, with a 10.72% expected percentage increase.

Western Africa

The forecasted cases from the Western African countries were grouped into two groups. As shown in Figs. 25 and 26, six countries were plotted together in each group. This was done in order to separate countries that have closer numbers of cumulative cases for a clear analysis of the results from the forecasting stage.

Fig. 25
figure 25

Actual and forecasted COVID-19 cumulative positive cases for Western Africa (a)

Fig. 26
figure 26

Actual and forecasted COVID-19 cumulative positive cases for Western Africa (b)

In Fig. 25, six countries from the Western region of Africa, including their respective forecasted and actual cumulative cases, are shown. According to this figure, it is clear that the expected cases in Guinea will have a small increase, which is immediately followed by a generally constant number of cases. In the rest of the five countries, a constant number of cases is expected, with small fluctuations by the end of the forecasting period. Since all countries in this figure maintained their respective fluctuation courses in the number of cases, it is evident that countries with a higher number of cases before the forecasting processes maintained these higher numbers after forecasting. Countries such as Guinea, with the highest number of actual cases, are still expected to have the highest number of forecasted cases, as depicted in Fig. 25. Since there is no expected significant decrease in the forecasted cases, this still presents a great risk for the region if preemptive measures are not taken.

In Fig. 26, the rest of the six countries from the Western region of Africa are given, including the forecasted and actual cases in each state. A significant increase in the expected cases in Mali is observed, while in the rest of the countries, a constant number of cases with minor fluctuations is observed.

In Gambia, a very small decrease is expected to occur in the forecasted number of cumulative cases at the end of the forecasting period. In this country, cases are forecasted to decrease from 9967 to 9964. In other countries in the Western region, apart from the Gambia, there is an expected increase in the number of cases. The COVID-19 cumulative positive cases are expected to increase from 6366 to 6565, 6134 to 6151, 30653 to 30909, 14793 to 14848, 26079 to 26195, 6398 to 6408, 73917 to 74171, 211961 to 214460, 16074 to 19734, 5815 to 5838, and 130077 to 131347 in countries such as Niger, Guinea-Bissau, Guinea, Burkina Faso, Togo, Sierra Leone, Senegal, Nigeria, Mali, Liberia, and Ghana, respectively. According to these results, it is observed that the highest expected percentage increase of 22.77% is expected to occur in Mali.

Conclusions and Suggestions

This study involves the forecasting of COVID-19 cumulative positive cases in countries from the five major regions of the African continent, which include the Northern, Eastern, Western, Central, and Southern regions. To contain and control the spread of the COVID-19 pandemic, there is a great need for strategies that can predict the future course that the pandemic might take beforehand. This is because it would enable authorities to plan ahead of time and eventually allocate resources effectively and efficiently to more critical areas. There is a significant gap in the literature for studies that consider a continent’s perspective, especially in Africa, when dealing with the forecasting of COVID-19. This study aimed at closing this gap by focusing on the forecasting and investigation of the expected future COVID-19 cumulative positive cases for a period of sixty- one days. From the forecasted values, this study aims to also identify the most critical states in each of the five major regions that have the highest expected percentage increase in the number of cases.

To achieve these objectives, this study employed both statistical and deep learning approaches, which consisted of three prediction models that were composed of the ARIMA, Prophet, and LSTM models. In a comparative analysis of the performance of these three models, seven performance metrics were used. These included the MSE, RMSE, MAPE, SMAPE, R2 score, NRMSE, and PSNR. The best-performing model was then selected to perform the forecasting of the future COVID-19 cumulative positive cases for a 61-day perspective. In this study, the best-performing model was the LSTM model, while the worst-performing model was the Prophet model. The highest expected increase in the number of cases from the Western African region is expected to be 22.77% from Mali. On the other hand, in Angola, a country from the Southern region, the overall highest expected increase is 18.97%. The highest expected increase from the Northern region is expected to take place in Egypt, at 11.83%. In the Eastern region, the highest increase of 10.72% is expected to occur in Somalia. Lastly, from the Central African region, the highest expected increase is 2.81% in Gabon. There is a need for studies that consider the influence of population demographics on the spread of COVID-19