Introduction

In December 2019, a type of new pneumonia of unknown etiology initially occurred in the city of Wuhan, China, and soon afterward, Wuhan became the epicenter of the outbreak of this disease, later named as coronavirus disease 2019 (COVID-19) caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)1,2. Since then, COVID-19 has been bombarding almost every corner of the world for just two months and has become a universal pandemic3,4. COVID-19 is highly contagious and has caused a series of massive negative effects on economic progress, people’s lives and health around the globe, and it has been identified as being the foremost global public health crisis since the twentieth century5,6. As of June 27, 2020, the outbreak has resulted in a great tragedy with overall 9,653,048 confirmed cases and 491,128 deaths in more than 200 countries of our planet2. The current reported cases and deaths may be underestimated in the seriously affected regions to a great extent as there are limited medical and health resources that satisfy the requirement of the epidemiological surveillance and detection7, and it is estimated that the present epidemiological trend may still be rising exponentially in the near future2. Such an emergency has raised many significant issues associated with the spreading dynamics, the alleviation, along with the response strategies and measures of this public health emergency of international concern. Unfortunately, because of the new nature of the SARS-CoV-2, there is still an absence of enough knowledge regarding this virus and an absence of clinical treatment determined and vaccines available, leading to greater uncertainty in the decision-making process. In this scenario, an accurate estimate based on mathematical and statistical techniques can provide a basis for the formulation of effective planning to better tackle the societal, economical, cultural, and public health issues related to this pandemic8,9. Also, it is extremely crucial for directing the intensity and type of interventions required to mitigate this public health emergency10.

Time series analysis is significantly instrumental in understanding the past epidemic patterns of the diseases and in forecasting the upcoming epidemiological trends based on the past and current inherent rules of the target series by adopting different modeling methods4,7,11. Over the past decades, different time series modeling techniques with high reliability levels have been employed for various forecasting purposes. More recently, a large and growing body of literature has investigated the usefulness of the statistical methods to forecast the transmission of the COVID-19 outbreak in order to serve as a reference for mitigating the outbreak, and some of which have played an important role in containing the spread of the COVID-19 outbreak. For example, many current prevention and control measures (e.g., keeping social distancing, wearing face masks, isolation, and observation of the cases and close contacts, the establishment of mobile cabin hospitals, lockdown of the area or countries, travel restrictions and border control, and human mobility restrictions) are formed based on the resulting results of model forecasting4,12,13,14,15,16,17,18. The current common use of the modeling methods includes the autoregressive integrated moving average (ARIMA) model4,7,19,20,21,22,23,24, genetic programming25, simple model of growth26, support vector regression27, unbiased hierarchical bayesian estimator approach28, susceptible-exposed-infected-recovery (SEIR) model28, linear regression models29, and stereographic Brownian diffusion epidemiology model (SBDiEM)30. Time series data are often restricted and affected by many potential determinants, leading to showing complicated linear and nonlinear interaction, together with non-stationarity in the data31. For this reason, the mentioned methods failed to take full advantage of these components simultaneously as they are under the linear or nonlinear assumption, and therefore the results from them are difficult to be generalized. To improve the forecasting reliability level, an alternative approach should be tailored for use with both tendencies (linear component) and randomness (nonlinear component). Motivated by this idea, researchers have developed hybrid models by integrating linear models with nonlinear models (e.g., ARIMA-generalized regression neural network [GRNN], ARIMA– backpropagation neural network [BPNN], and autoregressive [AR]-time delay neural network [TDNN] hybrid models)32,33,34, which may generate better forecasting by use of each method’s capability. In such traditional ensemble architectures, the ARIMA or AR model is often used to capture the linear dependency structure in a time series, and then the residuals of a linear pattern is assumed to include the nonlinear component that can be captured by the neural network models (ANNS)34,35. However, such an assumption may lead to an underestimation of the relationship between the linear and nonlinear patterns in a time series because the association between these two patterns may fail to be additive32. Moreover, the residuals from the linear models may not comprise valid non-linear component in a time series32. Importantly, recent published papers have also demonstrated that the traditional mixture methods do not necessarily provide a performance improvement over the individual methods32,35,36. For this reason, the challenge for developing a perfect hybrid prediction model is how to identify the underlying linear and nonlinear patterns in a time series.

Wavelet analysis has attracted much attention as a flexible and useful tool able to diagnose high-frequency traits and to extract worthy information especially when time series is characterized by non-stationarity and non-linearity because this analysis has a powerful potential to discern exceptional events by time-localized frequency analysis4,37,38. More recently, researchers have developed a novel wavelet decomposition technique-ensemble empirical mode decomposition (EEMD) based on the empirical mode decomposition (EMD) for filtering and handling time series preliminarily, which is capable of overcoming the mode mixing weaknesses of the EMD39,40. Unlike the conventional discrete wavelet transform methods that require and predetermine basis functions, causing different decomposition results, EEMD is a self-adaptive, empirical, direct, and intuitive data processing technique, particularly appropriate for handling the non-stationary and non-linear data patterns41,42. And many hybrid models that adopt a combination of the EEMD and some algorithms have produced satisfactory results in the time series forecasting field. For instance, Zhou et al. built a mixture model by combining the EEMD and a general regression neural network to predict the PM2.5 concentrations43. Wang et al. constructed an EEMD decomposition-based ARIMA to improve the prediction reliability level of the annual runoff time series41. Wang et al. applied the backpropagation network model based on EEMD decomposition to hydrological time series in order to improve the medium and long-term forecasting accuracy level41. However, the above-referenced models are only a simple ensemble architecture comprising either a basic linear or nonlinear model based on the EEMD technique, which is unable to consider both linear and nonlinear components in a time series simultaneously despite a performance improvement over the basic models by use of these ensemble architectures. Motivated by the “decomposition and ensemble” idea based on the EEMD method, a promising alternative is to develop an ensemble architecture by integrating the linear trait with the nonlinear trait decomposed by the EEMD method using an adequate linear model and nonlinear model44. By doing so, this new ensemble architecture is capable of capturing both components in a time series simultaneously.

In time series forecasting, the ARIMA model is the most used method to handle linear information, whereas ANNS methods are adept at solving nonlinear problems, and the nonlinear autoregressive artificial neural network (NARANN) model has been demonstrated to have excellent mimic and prediction performances among ANNs models because this model has embedded memory function with the help of the tapped delay lines45. Therefore, the present study developed a novel mixture prediction model by considering the respective superiority of the EEMD, ARIMA, and NARANN in addressing time series forecasting issues to estimate the epidemiological trends of the COVID-19 prevalence and mortality in South Africa and Nigeria, the hardest-hit two countries with the outbreak in Africa2,46. Specifically, first, applying the EEMD technique to decompose the daily prevalence and mortality series into several Intrinsic Mode Functions (IMFs) subseries together with a residue subseries representing the trend of the data. Second, the IMFs terms were modeled using appropriate NARANN methods, whereas the residue term was modeled with a suitable ARIMA model. Finally, the prediction results from our proposed hybrid model were obtained by a conjunction of those from the basic NARANN and ARIMA models44. Since the lack of adequate health infrastructure and services in many regions of Africa, such estimates can elucidate the spreading dynamics of the outbreak, which will be a useful aid for government institutions and policymakers to plan the number of additional materials and resources in order to keep the outbreak under control well. Additionally, such estimates may also assist local people to lessen their present socioeconomic and psychosocial pressures and distresses related to the COVID-19 pandemic.

Material and methods

Data source

This research focused on the daily time series analysis of the COVID-19 prevalence and mortality, the overall diagnosed COVID-19 cases and death tolls between 28 February 2020 and 27 June 2020 were taken from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (https://github.com/CSSEGISandData/COVID-19) and the COVID-2019 situation reports by the WHO (https://www.who.int/emergencies/diseases). Often, at least 50 observations and preferably 100 observations or more are required in order to construct an adequate and effective model47. Thus, the datasets used in this study were divided into two parts. The subset from 28 February 2020 through 15 June 2020 was treated as the training horizon (109 observations), the other was deemed as the prediction horizon (12 observations).

The study protocol was approved by the research institutional review board of the Xinxiang Medical University (No: XYLL-2019072). All relevant guidelines were followed for the study. Ethical approval is not warranted for this research as these data without personal information are publicly available around the globe and the same is approved by the CSSE and WHO.

ARIMA model

The ARIMA model has been the most frequently used forecasting tool in the domain of health care in the past because of its simple structure, flexible applicability, and potential to interpret a given time series7. Supposing that there exists a certain linear pattern between the past observations and the future observations, the ARIMA model can then make use of this pattern to predict the epidemic trends in the near future4,48. A representative ARIMA (p, d, q) model is composed of three components, where, p, d, and q represent the orders of the autoregressive method (AR), the non-seasonal differenced degrees, and the moving average method (MA), respectively. The ARIMA model is often established through four steps. Initially, an augmented Dickey–Fuller (ADF) test was applied to the original data to investigate its stationarity, if indicating a non-stationary series, a differenced transformation would help to achieve stationarity48,49. Secondly, the crude values of the key parameters (p, d, and q) were determined by plotting the autocorrelation function (ACF) and partial ACF (PACF) graphs based on the differenced series. Among all the candidate models, the one that produced such goodness of fit measures as a larger value of the log-likelihood, as well as a lower value of the Akaike information criteria (AIC), consistent AIC (CAIC), and Bayesian information criterion (BIC), was considered the preferred50. Thirdly, using statistical-based diagnostic indices, including Ljung-Box Q test, ACF plot, PACF plot, and t-test, to check the adequacy of the identified model, once the residuals behaved like a white-noise series under the Ljung-Box Q test and the determined parameters were statistically significant under the t-test, meaning that this model is suitable51. Ultimately, the preferred ARIMA method can be employed to conduct out-of-sample forecasts.

NARANN model

ANNs can well enable arbitrarily complex non-stationary series to obtain any desired accuracy thanks to its flexible nonlinear mapping ability52. The NARANN method with the time-varying state of interconnected neurons is an important dynamic recurrent ANNs model. For this reason, this method has the inherent attributes of ANNs (e.g., powerful nonlinear mapping capacity, self-learning and adaption ability, along with generalization and fault-tolerant ability)33,53. Further, the NARANN model also has a long or short-term memory function by retaining the prior inputs, outputs and network structures with the help of the tapped delay line, resulting in a dynamic modeling potential to the time-dependent series33. An NARANN method can be in the form below

$$X_{t} = f(x(t - 1),x(t - 2), \ldots ,x(t - d))$$
(1)

where \(X_{t}\) signifies the forecasting results from the NARANN method based on the previous given values at lagged period d.

In this study, the modeling procedures consist of three steps. First, the whole data were divided into two blocks including training samples (from 28 February 2020 to 15 June 2020) and testing samples (from 16 June 2020 to 27 June 2020). To develop an effective and accurate NARANN model, the effective training samples were further partitioned into training (80% of the training samples), validation (10%), and testing (10%) subseries by use of the dividerand function in MATLAB software. Second, the number of hidden neurons and delays d were investigated by trial and error by use of the Levenberg–Marquardt algorithm in an open feedback form33. Whilst the response plot between the estimated outputs and targets, the ACF plot, along with the mean square error (MSE) and correlation coefficient (R) were computed until the best possible specification was determined53. Finally, the training open-loop form was closed to make a multi-step-ahead forecast.

A hybrid model of EEMD-ARIMA-NARANN

EEMD

Although the EMD method has been widely employed to deal with the noisy nonlinear and non-stationary processes in signal analysis, it has been shown that this method suffers from two major shortcomings, including the edge-effects and mode-mixing in applications39,54,55, particularly for the mode-mixing issue, it can not only lead to the mixing of different scale vibration modes but also even result in the loss of the physical meaning of the decomposed IMFs terms40. To compensate for the weaknesses of the EMD method, an advanced EEMD technique was therefore introduced based on the EMD method39. This EEMD technique resolved the mode-mixing issue by defining the original each IMFs term as the average of an ensemble of experiments, and each IMFs term consists of the signal and noise of finite-amplitude54. The decomposition processes of the EEMD approach can be done as below:

Firstly, adding a white noise series \(w(t)\) to the original series \(x(t)\), and then the produced new time series was defined as

$$Y(t) = x(t) + w(t)$$
(2)

Secondly, decomposing this new time series into the IMFs terms by use of the EMD method.

Thirdly, repeating the first and second steps using different white noise series, and the obtained results were added to the original time series each time.

Finally, averaging the ensemble of the IMFs terms from the EMD method.

At the decomposition stage, determining the number of the ensembles and the amplitudes of the added white noise series is very crucial for the resultant results43. Fortunately, these two parameters can be determined by use of a well-demonstrated statistical rule39

$$\varepsilon_{n} = \frac{\varepsilon }{N}$$
(3)

where N is the number of the ensembles, \(\varepsilon\) represents the amplitudes of the added white noise series, and \(\varepsilon_{n}\) refers to the standard error. It has been shown that the EEMD technique can obtain a satisfactory result when the ensemble numbers were 100 and the amplitudes of added white noise series were 0.2 times standard deviation39,56.

EEMD-ARIMA-NARANN mixture model

To achieve the goal of making full use of the constituent linear and nonlinear components in the object series, inspired by the “decomposition and ensemble” idea of the EEMD method and its powerful flexible nonlinear mapping capacity of the NARANN method57, the EEMD-ARIMA-NARANN mixture method was thus constructed. In this advanced mixture model-developing process, the prevalence and mortality time series of COVID-19 were first decomposed into various IMFs and residue terms. Then, each of IMFs terms was modeled by use of an adequate NARANN method; whereas the residue term was modeled by use of an adequate ARIMA method. Finally, the results from our proposed mixture method could be obtained by combing the forecasts from the ARIMA and NARANN models (Fig. 1). By doing so, the new data-driven mixture technique can capture both linear and nonlinear patterns simultaneously in the prevalence and mortality series of COVID-19. The specific representation of our proposed EEMD-ARIMA-NARANN mixture method can be expressed as

$$\hat{b}_{t} = \sum\limits_{i = 1}^{N} ( f(IMF_{1} (t - 1), \ldots ,IMF_{1} (t - d)) + \cdots + (f(IMF_{N} (t - 1), \ldots ,IMF_{N} (t - d))$$
(4)
$$\hat{y} = {\hat{\text{a}}}_{{\text{t}}} + \hat{b}_{t}$$
(5)

where \(\hat{y}\) refers to the estimated results from the EEMD-ARIMA-NARANN mixture technique, \({\hat{\text{a}}}_{{\text{t}}}\) represents the estimated results from the ARIMA model, \(\hat{b}_{t}\) is the estimated results from the NARANN model.

Figure 1
figure 1

Flow chart of the novel data-driven EEMD-ARIMA-NARANN mixture method.

Assessing model performance

In this study, four statistical measures of error, including root mean square percentage error (RMSPE), mean absolute deviation (MAD), mean error rate (MER), and mean absolute percentage error (MAPE), were calculated to evaluate the accuracy of forecasts. The above statistical measures of error had smaller values, indicating a better model.

$${\text{RMSPE}} = \sqrt {\frac{1}{N}\sum\limits_{i = 1}^{N} {\left( ({X_{i} - \overline{X}_{i})/{X_{i}} } \right)^{{2}} } }$$
(6)
$${\text{MAD}} = \frac{1}{N}\sum\limits_{i = 1}^{N} {\left| {X_{i} - \hat{X}_{i} } \right|}$$
(7)
$${\text{MER}} = \frac{{\frac{1}{N}\sum\limits_{i = 1}^{N} {\left| {X_{i} - \hat{X}_{i} } \right|} }}{{\overline{X}_{i} }}$$
(8)
$${\text{MAPE}} = \frac{1}{N}\sum\limits_{i = 1}^{N} {\frac{{\left| {X_{i} } \right. - \left. {\hat{X}_{i} } \right|}}{{X_{i} }} \times 100}$$
(9)

here \(X_{i}\) signifies the prevalence and mortality data of COVID-19, \(\hat{X}_{i}\) is the estimates using the chosen approaches, \(\overline{X}_{i}\) refers to the mean of the prevalence and mortality data of COVID-19, and \(N\) stands for the number of simulations and forecasts.

Results

Development of the ARIMA model

During the study span, the overall confirmed cases totaled 12,459 in South Africa and 23,298 in Nigeria, with a daily mean of 1030 and 193 cases, respectively. Out of them, there were overall 2340 deaths in South Africa and 554 deaths in Nigeria, with a daily mean of 20 and 5 cases, respectively. As shown in Fig. 2, the prevalence and mortality time series displayed an apparent increasing trend, so the differencing is required to remove the trend effects of these target series. After differencing, an ADF test was employed to the differenced series, and the resulting statistics for the differenced series are illustrated in Table S1, indicating a stationary series. Thus, the possible values of the ARIMA models’ key parameters were crudely determined based on these stationary series. As illustrated in Table 1, it appeared that the sparse coefficient ARIMA (2, 2, (1, 3)) (AIC = 1482.590, CAIC = 1483.441, BIC = 1498.642, and Log-likelihood = -736.290) and ARIMA (0, 2,(1, 3, 4)) (AIC = 733.390, CAIC = 733.980, BIC = 746.750, and Log-likelihood = − 362.690) specifications were expected to be considered the best models for simulating the prevalence and mortality data, respectively, in South Africa because the measurement metrics of AIC, CAIC, and BIC provided the lowest values, and log-likelihood gave the greatest value among all the possible models. Furthermore, as illustrated in Tables 2 and 3, Fig. 3, the identified key parameters of the best-fitting ARIMA models showed a statistical significance (p < 0.05) and the Box-Ljung Q tests for the error series from these best models suggested no statistical significance at different lags (p > 0.05), these results meant that the identified optimal ARIMA models are adequate for modeling the target data. Similarly, the diagnostic checking for the best ARIMA models could be done on the residuals from the prevalence and mortality data in Nigeria (Tables 1, 2, 3 and Fig. 3), it was demonstrated that the ARIMA (1, 2, 2) and sparse coefficient ARIMA (0, 2,(1, 2, 4)) models were also suitable for modeling the prevalence and mortality data, respectively, in Nigeria. Accordingly, these preferred ARIMA models determined can be used to forecast the epidemics in the next days.

Figure 2
figure 2

Time series plots showing the prevalence and mortality data of COVID-19 in South Africa and Nigeria (A) The overall confirmed cases in these two countries; (B) The overall deaths in these two countries.

Table 1 The possible tested ARIMA models for the prevalence and mortality time series of COVID-19 in South Africa and Nigeria.
Table 2 The identified parameters of the best-fitting ARIMA models for the prevalence and mortality time series of COVID-19 in South Africa and Nigeria.
Table 3 Box-Ljung Q test for the residual series from the best ARIMA and NARANN models.
Figure 3
figure 3

Autocorrelogram (ACF) and partial autocorrelogram (PACF) of the residuals generated by the best ARIMA model. (A) Sample ACFs and PACFs of the residuals for the prevalence dada in South Africa; (B) Sample ACFs and PACFs of the residuals for the mortality dada in South Africa; (C) Sample ACFs and PACFs of the residuals for the prevalence dada in Nigeria; (D) Sample ACFs and PACFs of the residuals for the mortality dada in Nigeria. As shown, almost all sample ACFs and PACFs fell within the estimated 95% uncertainty levels across different lags except for the sample ACFs at lags 7, 8, and 15, along with the sample PACFs at lags 7 and 8 in (A), the sample PACFs at lags 12 and 17 in (B), and the sample ACFs and PACFs lag 15 in (C) (which are also reasonable because some higher-order correlation coefficients readily exceed the estimated 95% uncertainty levels by chance). These results meant that the residuals from identified ARIMA models for different datasets were without pattern, suggesting that the selected ARIMA models appear to be suitable for capturing the dynamic dependency structure in the object series.

Construction of the NARANN model

To obtain the preferred NARANN model, the different number of hidden units ranging from 1 to 20 and feedback delays ranging from 1 to 6 were trained by trial and error. After trying, it was found that the NARANN with 15 hidden units and 6 delays and the NARANN with 14 hidden units and 5 delays tended to be identified as the optimal specifications for mimicking the prevalence and mortality data, respectively, in South Africa as the NARANN (15,6) and NARANN (14,5) specifications showed the lowest MSE values in the training (2648.213 and 9.710, respectively), validation (1595.504 and 12.849, respectively), and testing (8647.196 and 24.024, respectively) subsets, along with the greatest R values in the training (1 and 1, respectively), validation (1 and 1, respectively), and testing (1 and 1, respectively) subsets of the prevalence and mortality data among all the potential models (Tables 3 and 4, Figures S1 and S2). Moreover, almost all autocorrelation coefficients of the resulting errors fell into the estimated 95% uncertainty level (UL) at different lags and the response plots between inputs and outputs showed that the resulting residuals presented an acceptable level of fluctuation in their corresponding subsets (Figs. 4, 5). The above-mentioned results intimated that the identified two best NARANN specifications offered reliable estimates for the prevalence and mortality series in South Africa. Likewise, we determined the best NARANN (15,6) and NARANN (14,6) specifications for fitting the prevalence and mortality data, respectively, in Nigeria according to the modeling steps, and the statistical checking results exhibited that these identified NARANN specifications were also appropriate (Tables 3, 4, Figs. 4, 5, S3 and S4). Therefore, these resulting best NARANN models can be applied to the target series to generate forecasts for the testing samples.

Table 4 The identified parameters of the best NARANN and EEMD-ARIMA-NARANN hybrid models for different target series.
Figure 4
figure 4

Autocorrelogram (ACF) of the residuals generated by the best NARANN model. (A) Sample ACFs of the residuals for the prevalence dada in South Africa; (B) Sample ACFs of the residuals for the mortality dada in South Africa; (C) Sample ACFs of the residuals for the prevalence dada in Nigeria; (D) Sample ACFs of the residuals for the mortality dada in Nigeria.

Figure 5
figure 5

Time series displaying the response results between inputs and outputs. (A) Response plot between inputs and outputs for the prevalence dada in South Africa; (B) Response plot between inputs and outputs for the mortality dada in South Africa; (C) Response plot between inputs and outputs for the prevalence dada in Nigeria; (D) Response plot between inputs and outputs for the mortality dada in Nigeria. These plots display which samples were treated as the training, validation and testing datasets, and illustrate the corresponding errors between inputs and targets. It could be seen that the vast majority of data points had smaller errors between inputs and targets, indicating that the identified NARANN methods seem to be adequate for estimating the epidemiological trends of COVID-19 in the study regions.

Establishment of the EEMD-ARIMA-NARANN hybrid model

Based on the decomposed procedures, the original target series was decomposed into different IMFs and residues (Fig. 6). Subsequently, the residues representing the trends of the target series were used to establish the ARIMA model, and the obtained best-fitting ARIMA models and their goodness of fit statistics for different target series are listed in Table 5; whereas the IMFs components representing the detailed (nonlinear) information contained in the target series were employed to develop the NARANN models, and the identified best-fitting NARANN models and their diagnostic testing results for various IMFs series are summarized in Table 4. Then each decomposed series is fitted and predicted by adopting the most appropriate target models and the resulting in-sample simulations and out-of-sample forecasts can be summed to obtain the final results from the advanced EEMD-ARIMA-NARANN hybrid model.

Figure 6
figure 6

Intrinsic Mode Functions (IMFs) subseries via decomposing the original prevalence and mortality time series. (A) The resulting IMFs subseries by decomposing the prevalence series in South Africa; (B) The resulting IMFs subseries by decomposing the mortality series in South Africa; (C) The resulting IMFs subseries by decomposing the prevalence series in Nigeria; (D) The resulting IMFs subseries by decomposing the mortality series in Nigeria.

Table 5 The identified parameters of the best-fitting ARIMA models for the decomposed residue of the COVID-19 the prevalence and mortality in South Africa and Nigeria.

Comparisons of forecasting accuracy level between models

We discovered that the EEMD-ARIMA-NARANN mixture model showed the lowest values of the measurement metrics, including MAD, MAPE, MER, and RMSPE, in addition to the RMSPE value in the prevalence data of Nigeria by comparing the forecasts for the testing samples from the selected best-fitting three models in the study regions (Table 6). Consequently, we can conclude that our proposed mixture model is superior to the basic ARIMA and NARANN models. Further, we re-established our proposed hybrid model to forecast the future 15-day epidemiological trends of the COVID-19 prevalence and mortality based on the overall data, and the resulting best models and the final forecasts are visible in Figs. 7, S5 and S6, Tables S1S5. So the next 15-day forecasts of confirmed cases may be 176,570 (95% UL 173,607 to 178,476) in South Africa and 32,136 (95% UL 31,568 to 32,641) in Nigeria, and the forecasts of total deaths may be 3454 (95% UL 3384 to 3487) in South Africa and 788 (95% UL 775 to 804) in Nigeria (Table S5).

Table 6 Comparisons of the predictive abilities for the testing samples of the prevalence and mortality time series of COVID-19 among these three selected models in South Africa and Nigeria.
Figure 7
figure 7

The next 15-day forecasts and their 95% uncertainty levels for the prevalence and mortality data using the best-fitting EEMD-ARIMA-NARANN mixture model. (A) The next 15-day forecasts for the prevalence data in South Africa; (B) The next 15-day forecasts for the mortality data in South Africa; (C) The next 15-day forecasts for the prevalence data in Nigeria; (D) The next 15-day forecasts for the mortality data in Nigeria.

Discussion

Effective prevention and control plans are needed to curb and harness the rapid transmission of the COVID-19 outbreak. Early nowcasting and forecasting are essential to forming such plans as the allocation of limited health resources, the timely adjustment of the current intervention strategies, the arrangement of production activities, and even the local economic development30,31,58. For this reason, it is imperative to develop statistical techniques with high forecasting accuracy and reliability. Time series modeling is a useful aid for developing underlying hypotheses to analyze the current epidemic patterns and to predict the spreading dynamics of different diseases in the near future4,7. As far as we are aware, this is the only study to analyze and forecast the epidemiological trends of the COVID-19 prevalence and mortality time series in South Africa and Nigeria by use of a novel data-driven EEMD-ARIMA-NARANN hybrid technique, and a series of modeling experiments indicated that this new hybrid technique produced lower forecasting errors over the basic ARIMA and NARANN methods by comparing the measurement metrics, such as MAD, MAPE, MER, and RMSPE (Table 6). These results meant our proposed hybrid method has a greater potential to track the dynamic dependence characteristics during the epidemic process of COVID-19 relative to the others used in this study, which may act as a profitable tool-supportive for policymakers to develop appropriate prevention and control strategies and measures in both mitigating the outbreak and reducing the deaths due to COVID-19 pandemic. Whilst this hybrid model is also of great value in assessing the effects of the current public interventions. For example, if this model forecasted a remarkably higher epidemic level than the actual in the coming periods, suggesting that the current measures could take effect in the target population; otherwise, indicating that the current public interventions could be required to be reinforced or additional plans could be needed. In addition, the basic ARIMA and NARANN models also provided a high forecasting accuracy for our target data in light of the above four measurement metrics.

The most versatile method to fit the time series data is the ARIMA model, which postulates that there is a certain linear association between the future epidemics of a given series and the past and present states of the target series, and thus this model can not only be used to model nonseasonal data but also seasonal data, and such benefit as nonstationary data48,49. Yet for nonstationary series, it requires to be differenced and/or transformed with logarithm or square root50. For instance, Yousaf et al. built the ARIMA (0, 2, 1), ARIMA (2, 2, 0), and ARIMA (1, 2, 1) models to study and predict the accumulative confirmed cases, recoveries, and deaths of COVID-19, respectively, for the upcoming month in Pakistan19. Ceylan established the ARIMA (0, 2, 1), ARIMA (1, 2, 0), and ARIMA (0, 2, 1) models to forecast the total reported cases of COVID-19 in Italy, Spain, and France, respectively7. Even though these obtained ARIMA models have high forecasting accuracy and reliability, the major disadvantage of the ARIMA model is its linear assumption, which makes it difficult to handle the randomness in the target series52. Hence, we proposed a novel data-driven EEMD-ARIMA-NARANN hybrid model to overcome the limitation of the basic model. It can be said that this data-driven mixture technique shows a strong capacity to improve the forecasting power for the prevalence and mortality data of COVID-19 in that the principal advantage of such a model facilitates to identify the preferred hybridization by decomposing the target data into various multi-scale levels to consider the underlying trend and random parts simultaneously by use of the different types of models. Given the forecasting superiority of our proposed data-driven hybrid method, it seems that this hybrid model is also useful in nowcasting and forecasting the epidemiological trends of the COVID-19 prevalence and mortality time series in other regions or other infectious diseases44. Of note, current studies found that some other forecasting tools (e.g., the new innovations state space modeling framework59, long short-term memory neural network60, advanced error-trend-seasonal (ETS) framework61, α-Sutte Indicator62, and SBDiEM30) performed a highly accurate forecast for the epidemiological trends of COVID-19. As a result, to further our research we are planning to make a comparative study between our proposed EEMD-SARIMA-NARANN hybrid model and the ones above. The contributions of the current work are several-fold. First, at least 14.321% and at most 40.488%, along with at least 22.545% and at most 59.766% of computational accuracies are achieved compared with the ARIMA and NARANN models, respectively, when using the MAPE (which is the most frequently used index to judge the predictive performance) to measure the forecasting accuracy. Second, this work presents a new data-driven integrated system in a more reasonable way compared with the conventional mixture pattern. Third, this new data-driven hybrid model may be generalized to estimate the epidemic patterns in other regions seriously affected by the COVID-19 outbreak.

Given the outbreak trends of COVID-19 and the situation of the health infrastructure and services in Africa, there is a great concern on whether African regions’ health system capacity is able to duly and effectively meet the requirements of the medical supplies for the increased confirmed cases. For this reason, we used our proposed mixture technique to predict the next 15-day confirmed cases and deaths in South Africa and Nigeria. Particularly in South Africa, the infected individuals show an exponential trend since 18 May 2020 (Figs. 2, 7), and even worse, our prediction results display that the epidemiological trends of the outbreak may still be rapidly increasing with an average of around 3465 confirmed cases and 75 deaths per day in the upcoming 15 days in South Africa (Fig. 7A,B, Table S5), and it needs more time to reach the platform in the morbidity. Therefore, more strict or additional precautionary measures are required to reduce the rapid spreading of COVID-19 (e.g., increasing the number of doctors, pharmacists, medical students, and other health workers who can offer their expertise in the frontlines of the pandemic response, strengthening the overnight curfew management to prevent the social interaction, raising public awareness by strengthening advocacy, issuing more stringent lockdown rules, building more mobile cabin hospitals to treat the mild patients, forcing mandated face-covering in public, suspending trans-regional public transportation, suspending or prohibiting tourism across regions, strengthening inspection and quarantine, extending the closure period of public places such as schools, universities and church, supporting the home office work, prohibiting possible social gatherings, accelerating research on the vaccines and clinical treatment programmes, and seeking help from other countries in a position to do so)12,19,31,60,63. Nigeria that was hit the second hardest with the COVID-19 outbreak is witnessing a downward trend in the COVID-19 prevalence and mortality with daily 590 estimated confirmed cases and 16 deaths in the next 15 days (Fig. 7C,D, Table S5). However, strict prophylactic measures still need to be implemented in Nigeria to avoid the rebounding of the outbreak.

The findings in this report are subject to some shortcomings. Firstly, accurate statistics on the prevalence and mortality data in these two study regions are vital for the understanding of the epidemic patterns of COVID-19 by use of our proposed data-driven EEMD-ARIMA-NARANN hybrid technique. However, the limited nuclear acid detection ability may result in under-diagnosis or under-reporting for the prevalence and mortality data during the COVID-19 outbreak. Secondly, in the NARANN method-developing process, there is currently a lack of general guidelines for selecting the number of hidden neurons and delays. In applications, repeated training is required. Thirdly, although this data-driven mixture technique does a good job of estimating the epidemic patterns of COVID-19 in this study, whether this data-driven mixture technique can perform a highly accurate prediction for the epidemiological trends of COVID-19 in other regions or other infectious contagious diseases, more work will need to be done. Fourthly, the forecasting performance under the EEMD-ARIMA-NARANN hybrid technique may be further improved by integrating some related factors (e.g., internet search queries, meteorological parameters, air pollution indicators, and policy intervention), and further studies, which take these factors related to the COVID-19 into account, will be very interesting. However, this failed to be investigated in the current work. Lastly, the forecasting reliability level of this data-driven mixture technique may decrease with the increase of the forecasting periods. Therefore, the new real-time data should be integrated into the model to ensure its forecasting accuracy.

Conclusions

Insights from the time series modeling are extremely invaluable for the policymaker to plan effective prevention and control strategies in order to make the outbreak under control well in the future. In this work, we proposed a new data-driven EEMD-ARIMA-NARANN mixture technique, and it is demonstrated that the predicted values from this mixture model show better consistency with the actual observations than the basic ARIMA and NARANN methods, which can function as a helpful policy-supportive tool to plan and prepare medical supplies effectively, and thus favoring to alleviate the outbreak in South Africa and Nigeria over the upcoming days or weeks. It is significant to stress that the estimated values may differ from the observed values looking at the strategic preparedness and the measures taken by the government of these study regions. Also, our proposed hybrid model may be of great help to estimate and forecast the future epidemic trends in other regions severely affected by this crisis.