Introduction

Outbreaks of the COVID-19 pandemic have been causing worldwide socioeconomic and health concerns since December 20191, putting high pressure on healthcare services2. SARS-CoV-2, the causative agent of COVID-19, spreads efficiently and1,3, consequently, the effectiveness of control measures depends on the relationship between epidemiological variables, human behaviour, and government intervention to control the spread of the disease3,4. Different attempts to model the virus outbreaks in many countries have been made, involving mechanistic models5,6, and extensions based on susceptible-infected-recovered (SIR) systems7,8,9. Countries have also put together task forces to work with COVID-19 data and study the direct and indirect impact on the population, economy, banking and insurance, and financial markets2,10,11,12. In addition, funding agencies have put together rapid response calls worldwide for projects that can help to deal with this pandemic. However, further investment is still needed to foment priority research involving SARS-CoV-2, so as to establish high-level coordination of essential, policy-relevant, social and mental health science13,14.

This pandemic is associated with high basic reproduction numbers15,16, spreading with great speed since a significant number of infected individuals remain asymptomatic while still being able to transmit the virus17. Simulation studies18 show that the adoption of mitigation strategies is unlikely to be feasible to avoid going beyond healthcare systems’ capacity limits. Moreover, even if all patients were able to be treated, about 250,000 to over a million deaths are expected in the UK and the US, respectively18. A pressing concern here is how to avoid bringing healthcare systems to a collapse17,18. Knowing how the outbreak is progressing is crucial to predict whether or when this will happen, and therefore to plan and implement measures to reduce the number of cases so as to avoid it.

Policies for reducing the number of infected people, such as social distancing and movement restrictions, have been put in place in many countries, but for many others, a full lockdown may be very difficult (if not impossible) to implement. This also depends heavily on the country’s political leadership, socio-economic reality, and epidemic stage1,19. In this context, accurate short-term forecasting would prove itself invaluable, especially for systems on the brink of collapse and countries whose governments must consider trade-offs between lockdowns and avoiding full economical catastrophe.

The main problem is that not only is this disease new, but there are also many factors acting in concert, resulting in a seemingly unpredictable outbreak progression. Forecasting with great accuracy under these circumstances is very difficult. Here we propose a new modelling framework, based on a state-space hierarchical model, that can generate forecasts with excellent accuracy for up to seven days ahead. To aid policy making and effective implementation of restrictions or reopening measures, we provide all results as an R Shiny Dashboard, including week-long forecasts for every country in the world whose data is collected and made available by the European Centre for Disease Prevention and Control (ECDC).

Results

Our model displayed an excellent predictive performance for short-term forecasting. We validated the model by fitting it to the data up to 18-November-2020, after removing the last seven time points (from 19-November-2020 until 25-November-2020), and compared the forecasted values with the observed ones (Fig. 1A). We observe that it performs very well (Fig. 1B). The concordance correlation coefficient and Pearson correlation coefficient (a measure of precision) remain higher than 0.8, even for the last day ahead forecast. We observed an accuracy close to 1, showing our methodology has a high potential to expand the number of forecast days to more than seven. Interestingly, we observed that Spain reported zero cases on the 22nd and 23rd of November, while the forecast for these days are around 19, 950 (\(\approx 10^{4.3}-1\)). In these circumstances, the forecast could be considered a better depiction of reality than the reported observations.

We carried out this same type of validation study using data up to 18-November-2020, 13-May-2020, 6-May-2020, and up to 29-Apr-2020, and the results were very similar to the ones outlined above, although the validation study carried out in November showed better performance than the validation done in May (see Supplementary Materials). Furthermore, even though performance is expected to fall as the number of days ahead increases, there are still many countries where the forecasted daily number of new cases is very close to the observed one.

Figure 1
figure 1

(A) Logarithm of the observed \(y_{it}\) versus the forecasted daily number of cases \(y^*_{it}\) for each country, for up to seven days ahead, where each day ahead constitutes one panel. The forecasts were obtained from the autoregressive state-space hierarchical negative binomial model, fitted using data up to 18-November-2020. The first day ahead corresponds to 19-November-2020, and the seventh to 25-November-2020. Each dot represents a country, and the sixteen countries shown in Fig. 2 are represented by blue triangles. We add 1 to the values before taking the logarithm. (B) Observed accuracy, concordance correlation coefficient (CCC) and Pearson correlation (r) between observed (\(y_{it}\)) and forecasted (\(y^*_{it}\)) values for each of the days ahead of 18-November-2020.

The autoregressive component in the model directly relates to the pandemic behaviour over time for each country (see Supplementary Materials). It is directly proportional to the natural logarithm of the daily number of cases, given what happened in the previous day. Therefore, it is sensitive to changes and can be helpful to detect a possible second wave. See, for example, its behaviour for Australia, Iceland and Ireland—it shows that the outbreak is decaying; however, it may still take time to subside completely (Fig. 2). In Austria, France, Italy, Japan, Serbia, South Korea, Spain and the UK, however, the outbreak is in the middle of another wave, having peaked in some of those countries or aiming at a peak. Hence, these countries must be very cautious when relaxing restrictions. In Brazil and the United States, it seems as if the outbreak is taking a long time to subside.

Figure 2
figure 2

Posterior means of the autoregressive component \(\gamma _{it}\) (solid lines) and associated 95% credible intervals (shaded areas) for each of sixteen selected countries from the pool of 214 countries and territories in the data, from 1-Jan-2020 until 25-Nov-2020.

The estimates for the model parameters suggest that about 12.6% of the number of reported cases can be viewed as contributing to extra variability and possibly may consist of outlying observations (see the model estimates in the Supplementary Materials). These observations may be actual outliers, but this is likely a feature of the data collection process. In many countries, the data recorded for a particular day actually reflects tests that were done over the previous week or even earlier. This generates aggregated-level data, which is prone to exhibiting overdispersion, which is accounted for in our model, but for some observations, this variability is even greater, since they reflect a large number of accumulated suspected cases that were then confirmed. There is also a large variability between countries regarding their underlying autoregressive processes (see the estimates for the variance components in the Supplementary Materials). This corroborates that countries are dealing with the pandemic in different ways and may collect and/or report data differently.

We propose clustering the countries based on the behaviour of their estimated autoregressive parameter over the last 60 days (Fig. 3). This gives governments the opportunity to see which countries have had the most similar recent behaviour of the outbreak and study similar or different measures taken by these other countries to help determine policy. We observe, for example, that Spain, the UK and Russia have been experiencing a very similar situation recently; the same can be said about India and the US. Our R Shiny dashboard displays results in terms of forecasts and country clustering. It can be accessed at https://prof-thiagooliveira.shinyapps.io/COVIDForecast/. Through the dashboard, users can choose to highlight a different number of clusters, which may provide other insights.

Figure 3
figure 3

Dendrogram representing the hierarchical clustering of countries based on their estimated autoregressive parameters \(\hat{\gamma }_{it}\) from 26-Sep-2020 to 25-Nov-2020. The clustering used Ward’s method and pairwise dynamic time warp distances between the countries’ time series. Each of 10 clusters is represented with a different colour. Country abbreviations: BSES = Bonaire, Saint Eustatius and Saba; IC Japan = Cases on an international conveyance - Japan; CAE = Central African Republic; DRC = Democratic Republic of the Congo; NMI = Northern Mariana Islands; SKN = Saint Kitts and Nevis; SVG = Saint Vincent and the Grenadines; STP = São Tomé and Príncipe; TC Islands = Turks and Caicos Islands; UAE = United Arab Emirates.

Discussion

Our modelling framework allows for forecasting the daily number of new COVID-19 cases for each country and territory for which data has been gathered by the ECDC. It introduces statistical novelty in terms of modelling the autoregressive parameter as a function of time. This makes it highly flexible to adapt to changes in the autoregressive structure of the data over time. In the COVID-19 pandemic, this translates directly into improved predictive power in terms of forecasting future numbers of daily cases. Our objective here is to provide a simple, yet not overly simplistic, framework for country-level decision-making, and we understand this might be easier for smaller countries when compared to nations of continental dimensions, where state-level decision-making should be more effective20. Moreover, the model can be adapted to other types of data, such as the number of deaths, and also be used to obtain forecasts for smaller regions within a country. A natural extension would be to include an additional hierarchical structure taking into account the nested relationship between cities, states, countries and ultimately continents, while also accommodating the spatial autocorrelation. This would allow for capturing the extra-variability introduced by aggregating counts over different cities and districts.

We remark that one must be very careful when looking at the forecasted number of cases. These values must not be looked at in isolation. It is imperative that the entire context is looked at and that we understand how the data is actually generated21. The model will obtain forecasts based on what is being reported, and therefore will be a direct reflection of the data collection process, be it appropriate or not. When data collection is not trustworthy, model forecasts will not be either. Our estimated number of outlying observations of approximately 12.6% is relatively high. When looking at each country’s time series, we observe that countries that display unusual behaviour in terms of more outlying observations are usually related to a poorer model predictive performance as well. This suggests that the data collection process is far from optimal. Therefore, looking at the full context of the data is key to implementing policy based on model forecasts.

As self-criticism, our model may possibly be simplistic in the sense that it relies on the previous observation to predict the next one but does not rely on mechanistic biological processes, which may be accommodated by SEIR-type models5,22,23. These types of models allow for a better understanding of the pandemic in terms of the disease dynamics and are able to provide long-term estimates6. However, as highlighted above, our objective is short-term forecasting, which is a task that is already very difficult. Long-term forecasting is even more difficult, and we believe it could even be speculative when dealing with its implications pragmatically. A more tangible solution to this issue is to combine the short-term forecasting power of our proposed modelling framework with the long-term projections provided by mechanistic models so as to implement policy that can solve pressing problems efficiently, while at the same time looking at how it may affect society in the long run. We acknowledge all models are wrong, including the one presented here. However, we have shown that it provides excellent forecasts for up to a week ahead, and this may be able to help inform the efficacy of country lockdown and/or reopening policies. This is especially useful when using our proposed method coupled with models that take into account the incubation period, variable lag times associated with disease progression, and the estimation of the impact of public health measures. Therefore, the practical relevance of our proposed method is stronger when linked to complementary long-term information.

Even though the model performs very well, we stress the importance of constantly validating the forecasts, since not only can the underlying process change, but also data collection practices change (to either better or worse). Many countries are still not carrying out enough tests, and hence the true number of infections is sub-notified24. This hampers model performance significantly, especially that of biologically realistic models5.

There is a dire need for better quality data21. This is a disease with a very long incubation period (up to 24 days25), with a median value varying from 4.5 to 5.8 days26, which makes it even harder to pinpoint exactly when the virus was actually transmitted from one person to the other. Furthermore, around \(97.5\%\) of those who develop symptoms will do so in between 8.2 and 15.6 days26. The number of reported new cases for today reflects a mixture of infections transmitted at any point in time over the last few weeks. The time testing takes also influences this. One possible alternative is to model excess deaths when compared to averages over previous years27,28. This is an interesting approach, since it can highlight the effects of the pandemic (see, e.g., the insightful visualisations at Our World in Data29, and the Finantial Times30). Again, this is highly dependent on the quality of the data collection process, not only for COVID-19 related deaths but also those arising from different sources. Many different teams of data scientists and statisticians worldwide are developing different approaches to work with COVID-19 data3,9,16,31. It is the duty of each country’s government to collect and provide accurate data. This way, it can be used with the objective of improving healthcare systems and general social wellbeing.

The developed R Shiny Dashboard displays seven-day forecasts for all countries and territories whose data are collected by the ECDC and clustering of countries based on the last 60 days. This can help governments currently implementing or lifting restrictions. It is possible to compare government policies between countries at similar pandemic stages to determine the most effective courses of action. These can then be tailored by each particular government to their own country. The efficiency of measures put in place in each country can also be studied using our modelling framework, since it easily accommodates covariates in the linear predictor. Then, the contribution of these country-specific effects to the overall number of cases can serve as an indicator of how they may be influencing the behaviour of the outbreaks over time.

Government policies are extremely dependent on the reality of each country. It has become clear that there are countries that are well able to withstand a complete lockdown, whereas others cannot cope with the economic downfall19. The issue is not only economic, but also of newly emerging health crises that are not due to COVID-19 lethality alone, but to families relying on day-to-day work to guarantee their food supplies. For these countries, there is a trade-off between avoiding a new evil versus amplifying pre-existing problems or even creating new ones. It is indeed challenging to create a one-size-fits-all plan in such circumstances, which makes it even more vital to strive for better data collection practices.

We hope to be able to contribute to the development of an efficient short-term response to the pandemic for countries whose healthcare systems are at capacity, and countries implementing reopening plans. By providing a means of comparing recent behaviour of the outbreak between countries, we also hope to provide a means to opening dialogue between countries going through a similar stage, and those who have faced similar situations before.

Methods

Data acquisition

The data was obtained from the European Centre for Disease Prevention and Control (ECDC) up to 14 December 2020 (after this date, the ECDC started to report aggregated weekly data; however, the methodology proposed here works for any other data source collecting daily data), and the code is implemented such that it downloads and processes the data from https://www.ecdc.europa.eu/en/geographical-distribution-2019-ncov-cases. We assumed non-available data to be zero prior to the first case being recorded for each country. Whenever the daily recorded data was negative (reflecting previously confirmed cases being discarded), we replaced that information with a zero. This is the case for only 18 out of 71,904 observations as of the 25th of November 2020.

According to the ECDC, the number of cases is based on reports from health authorities worldwide (up to 500 sources), which are screened by epidemiologists for validation prior to being recorded in the ECDC dataset.

Modelling framework

We introduce a class of state-space hierarchical models for overdispersed count time series. Let \(Y_{it}\) be the observed number of newly infected people at country i and time t, with \(i=1,\ldots ,N\) and \(t=1,\ldots ,T\). We model \(Y_{it}\) as a Negative binomial first-order Markov process, with

$$\begin{aligned} Y_{it}|Y_{i,t-1}&\sim \text{ NB }(\mu _{it},\psi ) \end{aligned}$$

for \(t=2,\ldots ,T\). The parameterisation used here results in \(\text{ E }(Y_{it}|Y_{i,t-1})=\mu _{it}\) and \(\text{ Var }(Y_{it}|Y_{i,t-1})=\mu _{it}+\mu _{it}^2\psi\). The mean is modelled on the log scale as the sum of an autoregressive component (\(\gamma _{it}\)) and a component that accommodates outliers (\(\Omega _{it}\)), i.e.

$$\begin{aligned} \log \mu _{it}&= \gamma _{it}+\Omega _{it}. \end{aligned}$$

To accommodate the temporal correlation in the series, the non-stationary autoregressive process \(\left\{ \gamma _{it} \right\}\) is set up as

$$\begin{aligned} \gamma _{it}&= \phi _{it}\gamma _{it-1} + \eta _{it}, \text{ with } \eta _{it} \sim \mathrm {N}\left( 0,\sigma ^2_{\eta }\right) , \end{aligned}$$
(1)

where \(\eta _{it}\) is a Gaussian white noise process with mean 0 and variance \(\sigma _{\eta }^2\). Differently from standard AR(1)-type models, here \(\phi _{it}\) is allowed to vary over time through an orthogonal polynomial regression linear predictor using the time as covariate, yielding

$$\begin{aligned} \phi _{it}&= \displaystyle \sum _{q=0}^{Q} (\beta _{q}+b_{iq})P_{q}(t), \text{ with } \varvec{b}_{i} \sim \mathrm {N}_Q\left( \mathbf {0}, \Sigma _b\right) \end{aligned}$$
(2)

where \(P_q(\cdot )\) is the function that produces the orthogonal polynomial of order q, with \(P_0(x)=1\) for any real number x; \(\beta _q\) are the regression coefficients and \(\varvec{b}_{i}\) is the vector of random effects, which are assumed to be normally distributed with mean vector \(\mathbf {0}\) and variance-covariance matrix \(\Sigma _b=\mathrm {diag}\left( {\sigma ^2_{b_0},\ldots ,\sigma ^2_{b_Q}}\right)\).

By assuming \(\phi _{it}\) varying by country over time, we allow for a more flexible autocorrelation function. Iterating (1) we obtain

$$\begin{aligned} \displaystyle \gamma _{it} = \left( \prod _{k=2}^t\phi _{ik}\right) \gamma _{i1}+\sum _{j=2}^{t-1}\left[ \left( \prod _{k=j+1}^t\phi _{ik}\right) \eta _{ij}\right] +\eta _{it} \end{aligned}$$

for \(t = 3,\ldots ,T\). Note that in the particular case where \(\phi _{it}=\phi _{i}=\beta _{0}+b_{0i}\), then \(\gamma _{it}= \phi _{i}^{t-1}\gamma _{i1}+\phi _{i}^{t-2}\eta _{i2}+\phi _{i}^{t-3}\eta _{i3}+ \cdots + \phi _{i}\eta _{it-1}+\eta _{it}\), which is equivalent to a country-specific AR(1) process. On the other hand, if \(\phi _{it}=\phi _{i}=\beta _{0}\), then \(\gamma _{it}= \phi ^{t-1}\gamma _{i1}+\phi ^{t-2}\eta _{i2}+\phi ^{t-3}\eta _{i3}+ \cdots + \phi \eta _{it-1}+\eta _{it}\), which is equivalent to assuming the same autocorrelation parameter for all countries.

Finally, to accommodate extra-variability we introduce the observational-level random effect

$$\begin{aligned} \Omega _{it}&= \lambda _{it}\omega _{it}, \end{aligned}$$

where \(\lambda _{it}\sim \text{ Bernoulli }(\pi )\) and \(\omega _{it}\sim \text{ N }(0,\sigma ^2_{\omega })\). When \(\lambda _{it}=1\), then observation \(y_{it}\) is considered to be an outlier, and the extra variability is modelled by \(\sigma ^2_{\omega }\). This can be seen as a mixture component that models the variance associated with outlying observations.

To forecast future observations \(y_{i,t+1}^*\), we use the median of the posterior distribution of \(Y_{i,t+1}|Y_{it}\). This is reasonable for short-term forecasting, since the error accumulates from one time step to the other. We produce forecasts for up to seven days ahead.

We fitted models considering different values for Q. Even though the results for \(Q=3\) showed that all \(\beta _q\) parameters were different from zero when looking at the 95% credible intervals, we opted for \(Q=2\) for the final model fit, since it improves the convergence of the model, as well as avoids overfitting by assuming a large polynomial degree, while still providing the extra flexibility introduced by the autoregressive function (2). This can change, however, as more data becomes available for a large number of time steps.

Model validation

We fitted the model without using the observations from the last seven days and obtained the forecasts \(y_{it}^*\) for each country and day. We then compared the forecasts with the true observations \(y_{it}\) for each day ahead by looking initially at the Pearson correlation between them. We also computed the concordance correlation coefficient32,33, an agreement index that lies between \(-1\) and 1, given by

$$\begin{aligned} \rho ^{(CCC)}_t&= 1 -\frac{\text{ E }\left[ \left( Y_{t}^* - Y_{t}\right) ^2\right] }{\sigma _{1}^{2}+\sigma _{2}^{2}+\left( \mu _{1}-\mu _{2}\right) ^2}=\frac{2\sigma _{12}}{\sigma _{1}^{2}+\sigma _{2}^{2}+\left( \mu _{1}-\mu _{2}\right) ^2} \end{aligned}$$

where \(\mu _{1}=\text{ E }\left( Y_{t}^*\right)\), \(\mu _{2}=\text{ E }\left( Y_{y}\right)\), \(\sigma _{1}^{2}= \text{ Var }\left( Y_{t}^*\right)\), \(\sigma _{2}^{2}= \text{ Var }\left( Y_{t}\right)\), and \(\sigma _{12}=\text{ Cov }\left( Y_{t}^*, Y_{2}\right)\). It can be shown that \(\rho ^{(CCC)}_t=\rho _t C_t\), where \(\rho _t\) is the Pearson correlation coefficient (a measure of precision), and \(C_t\) is the bias corrector factor (a measure of accuracy) at the \(t-\)th day ahead. \(\rho _t\) measures how far each observation deviated from the best-fit line while \(C_b \in [0,1]\) measures how far the best-fit line deviates from the identity line through the origin, defined as \(C_{b}=2\left( v+v^{-1}+u^{2}\right) ^{-1}\), where \(v = \sigma ^2_{1}/\sigma ^2_{2}\) is a scale shift and \(u = (\mu _{1} - \mu _{2}) / \sqrt{\sigma _1\sigma _2}\) is a location shift relative to the scale. When \(C_b=1\) then there is no deviation from the identity line.

Model implementation

The model is estimated using a Bayesian framework, and the prior distributions used are

$$\begin{aligned} \varvec{\beta }_{i}&\sim \text{ N}_Q(\mathbf {0}, \mathbf {I}_Q\times 1000) \\ \sigma ^{-2}_{b_q}&\sim \text{ Gamma }(0.001, 0.001) \\ \sigma ^{-2}_{\eta }&\sim \text{ Gamma }(0.001, 0.001) \\ \sigma ^{-2}_{\omega }&\sim \text{ Gamma }(0.001, 0.001) \\ \pi&\sim \text{ Uniform }(0, 1) \end{aligned}$$

We used 3 MCMC chains, 2,000 adaptation iterations, 50,000 as burn-in, and 50,000 iterations per chain with a thinning of 25. We assessed convergence by looking at the trace plots, autocorrelation function plots, and the Gelman-Rubin convergence diagnostic34.

All analyses were carried out using R software35, and JAGS36. Model fitting takes approximately 14 hours to run in parallel computing, in a Dell Inspiron 17 7000 with 10th Generation Intel Core i7 processor, 1.80GHz\(\times\)8 processor speed, 16GB RAM plus 20GB of swap space, 64-bit integers, and the platform used is a Linux Mint 19.2 Cinnamon system version 5.2.2-050202-generic.

Clustering

We used the last 60 values of the estimated autoregressive component to perform clustering so as to obtain sets of countries that presented a similar recent behaviour. First, we computed the dissimilarities between the estimated time series \(\varvec{\hat{\gamma }}_i\) for each pair of countries using the dynamic time warp (DTW) distance37,38. Let M be the set of all possible sequences of m pairs preserving the order of observations in the form \(r=((\hat{\gamma }_{i1}, \hat{\gamma }_{i^\prime 1}),\ldots ,(\hat{\gamma }_{im}, \hat{\gamma }_{i^\prime m}))\). Dynamic time warping aims to minimise the distance between the coupled observations \((\hat{\gamma }_{it}, \hat{\gamma }_{i^\prime t})\). The DTW distance may be written as

$$\begin{aligned} d(\varvec{\hat{\gamma }}_i,\varvec{\hat{\gamma }}_{i^\prime })&=\min _{r\in M}\left( \sum _{t=1}^m|\hat{\gamma }_{it}-\hat{\gamma }_{i^\prime t}|\right) . \end{aligned}$$

By using the DTW distance, we are able to recognise similar shapes in time series, even in the presence of shifting and/or scaling38.

Then, we performed hierarchical clustering using the matrix of DTW distances using Ward’s method, aimed at minimising the variability within clusters39, and obtained ten clusters. Finally, we produced a dendrogram of the results of the hierarchical clustering analysis, with each cluster coloured differently so as to aid visualisation.