Background

The Covid-19 pandemic that is hitting the world since late 2019 has made evident that having quality data is essential in the decision-making chain, especially in epidemiology but also in many other fields. There is an enormous global concern around this disease, leading the World Health Organization (WHO) to declare public health emergency [1]. Many methodological efforts have been made to deal with misreported Covid-19 data, following ideas introduced in the literature since the late nineties [2,3,4,5,6,7]. These proposals range from the usage of multiplication factors [8] to Markov-based models [9, 10] or spatio-temporal models [11]. Additionally, a new R [12] package able to fitting endemic-epidemic models based on approximative maximum likelihood to underreported count data has been recently published [13]. However, as a large proportion of the cases run asymptomatically [14] and mild symptoms could have been easily confused with those of similar diseases at the beginning of the pandemic, it is reasonable to expect that Covid-19 incidence has been notably underreported. Very recently several approaches based on discrete time series have been proposed [15,16,17] although there is a lack of continuous time series models capable of dealing with misreporting, a characteristic of the Covid-19 data and typically present in infectious diseases modeling. In this sense, a new approach for longitudinal data not accounting for temporal correlations is introduced in [18] and a model capable of dealing with temporal structures using a different approach is presented in [19]. A typical limitation of these kinds of models is the computational effort needed in order to properly estimate the parameters.

Synthetic likelihood is a recent and very powerful alternative for parameter estimation in a simulation based schema when the likelihood is intractable and, conversely, the generation of new observations given the values of the parameters is feasible. The method was introduced in [20] and placed into a Bayesian framework in [21], showing that it could be scaled to high dimensional problems and can be adapted in an easier way than other alternatives like approximate Bayesian computation (ABC). The method takes a vector summary statistic informative about the parameters and assumes it is multivariate normal, estimating the unknown mean and covariance matrix by simulation to obtain an approximate likelihood function of the multivariate normal.

Methods

Auto Regressive Conditional Heteroskedasticity (ARCH) models are a well-known approach to fitting time series data where the variance error is believed to be serially correlated. Consider an unobservable process \({X}_{t}\) following an AutoRegressive (\(AR\left(1\right)\)) model with ARCH(1) errors structure, defined by

$${X}_{t}={\phi }_{0}+{\phi }_{1}\cdot {X}_{t-1}+{Z}_{t},$$
(1)

where \({Z}_{t}^{2}={\alpha }_{0}+{\alpha }_{1}\cdot {Z}_{t-1}^{2}+{\epsilon }_{t},\) being \({\epsilon }_{t}\sim N\left({\mu }_{\epsilon }\left(t\right),{\sigma }_{\epsilon }^{2}\right)\). The process \({X}_{t}\) represents the actual Covid-19 incidence. In our setting, this process \({X}_{t}\) cannot be directly observed, and all we can see is a part of it, expressed as

$${Y}_{t}=\left\{\begin{array}{c}{X}_{t}\text{ with probability }1-\omega \\ q\cdot {X}_{t}\text{ with probability }\omega ,\end{array}\right.$$
(2)

where \(q\) is the overall intensity of misreporting (if \(0<q<1\) the observed process \({Y}_{t}\) would be underreported while if \(q>1\) the observed process \({Y}_{t}\) would be overreported) and \(\omega\) can be interpreted as the overall frequency of misreporting (proportion of misreported observations). To model consistently the spread of the disease, the expectation of the innovations \({\epsilon }_{t}\) is linked to a simplified version of the well-known compartimental Susceptible-Infected-Recovered (SIR) model. At any time \(t\in R\) there are three kinds of individuals: Healthy individuals susceptible to be infected (\(S\left(t\right)\)), infected individuals who are transmitting the disease at a certain speed (\(I\left(t\right)\)) and individuals who have suffered the disease, recovered and cannot be infected again (\(R\left(t\right)\)). As shown in [17], the number of affected individuals at time \(t\), \(A\left(t\right)=I\left(t\right)+R\left(t\right)\) can be approximated by

$$A\left(t\right)=\frac{{M}^{*}\left({\beta }_{0},{\beta }_{1},{\beta }_{2},t\right){A}_{0}{e}^{kt}}{{M}^{*}\left({\beta }_{0},{\beta }_{1},{\beta }_{2},t\right)+{A}_{0}\left({e}^{kt}-1\right)},$$
(3)

where \({M}^{*}\left({\beta }_{0},{\beta }_{1},{\beta }_{2},t\right)={\beta }_{0}+{\beta }_{1}\cdot {C}_{1}\left(t\right)+{\beta }_{2}\cdot {C}_{2}\left(t\right)\), being \({C}_{1}\left(t\right)\) and \({C}_{2}\left(t\right)\) dummy variables indicating if time \(t\) corresponds to a period where a mandatory confinment was implemented by the government and if the number of people with at least one dose of a Covid-19 vaccine in Spain was over 50% respectively. At any time \(t\) the condition \(S\left(t\right)+I\left(t\right)+R\left(t\right)=N\) is fulfilled. The expression (3) allow us to incorporate the behavior of the epidemics in a realistic way, defining \({\mu }_{\epsilon }\left(t\right)=A\left(t\right)-A\left(t-1\right)\), the new affected cases produced at time \(t\).

The Bayesian Synthetic Likelihood (BSL) simulations are based on the described model and the chosen summary statistics are the mean, standard deviation and the three first coefficients of autocorrelation of the observed process. Parameter estimation was carried out by means of the BSL [22, 23] package for R [12]. Taking into account the posterior distribution of the estimated parameters, the most likely unobserved process is reconstructed, resulting in a probability distribution at each time point. The prior of each parameter is set to be uniform on the corresponding feasible region of the parameter space and zero elsewhere.

Results

This section presents the results of the analyses using the proposed methodology over a real data set and they are compared to the most common alternatives. The performance of the method is also studied by means of a comprehensive simulation study, with and without covariates.

The performance and an application of the proposed methodology are studied through a comprehensive simulation study and a real dataset on Covid-19 incidence in Spain on this Section.

Simulation study

Although the estimation method is already known and has been tested before, to the best of our knowledge it has never been used in the context of ARCH time series, and therefore a thorough simulation study has been conducted to ensure that the model behaves as expected, including \(ARCH\left(1\right)\), \(AR\left(1\right)\), \(MA\left(1\right)\) and \(ARMA\left(\mathrm{1,1}\right)\) structures for the hidden process \({X}_{t}\) defined as

$$\begin{array}{c}{X}_{t}={\phi }_{0}+{\phi }_{1}\cdot {X}_{t-1}+{Z}_{t},{Z}_{t}^{2}={\alpha }_{0}+{\alpha }_{1}\cdot {Z}_{t-1}^{2}+{\epsilon }_{t}\text{ (ARCH(1))}\\ {X}_{t}={\phi }_{0}+\alpha \cdot {X}_{t-1}+{\epsilon }_{t}\text{ (AR(1))}\\ {X}_{t}={\phi }_{0}+\theta \cdot {\epsilon }_{t-1}+{\epsilon }_{t}\text{ (MA(1))}\\ {X}_{t}={\phi }_{0}+\alpha \cdot {X}_{t-1}+\theta \cdot {\epsilon }_{t-1}+{\epsilon }_{t}\text{ (ARMA(1, 1))}\end{array}$$
(4)

where \({\epsilon }_{t}\sim N\left({\mu }_{\epsilon }\left(t\right),{\sigma }_{\epsilon }^{2}\right)\).

The values for the parameters \({\phi }_{1}\), \({\alpha }_{0}\), \({\alpha }_{1}\), \(\alpha\), \(\theta\), \(q\) and \(\omega\) ranged from 0.1 to 0.9 for each parameter. Average absolute bias, average interval length (AIL) and average 95% credible interval coverage are shown in Table 1. To summarize model robustness, these values are averaged over all combinations of parameters, considering their prior distribution is a Dirac’s delta with all probability concentrated in the corresponding parameter value.

Table 1 Model performance measures (average absolute bias, average interval length and average coverage) summary based on a simulation study

For each autocorrelation structure and parameters combination, a random sample of size \(n=1000\) has been generated using the R function arima.sim, and the parameters \(m=log\left({M}^{*}\right)\) and \(\beta\) have been fixed to \(0.2\) and \(0.4\) respectively. Several values for these parameters were considered but no substantial differences in the model performance were observed related to the value of these parameters or sample size, besides a poorer coverage for lower sample sizes, as could be expected.

Real incidence of Covid-19 in Spain

The betacoronavirus SARS-CoV-2 has been identified as the causative agent of an unprecedented world-wide outbreak of pneumonia starting in December 2019 in the city of Wuhan (China) [1], named as Covid-19. Considering that many cases run without developing symptoms or just with very mild symptoms, it is reasonable to assume that the incidence of this disease has been underregistered. This work focuses on the weekly Covid-19 incidence registered in Spain in the period (2020/02/23–2022/02/27). It can be seen in Fig. 1 that the registered data (turquoise) reflect only a fraction of the actual incidence (red). The grey area corresponds to 95% probability of the posterior distribution of the weekly number of new cases (the lower and upper limits of this area represent the percentile 2.5% and 97.5% respectively), and the dotted red line corresponds to its median.

Fig. 1
figure 1

Registered and predicted weekly new Covid-19 cases in each Spanish region

In the considered period, the official sources reported 11,056,797 Covid-19 cases in Spain, while the model predicts a total of 21,639,627 cases (only 51.10% of actual cases were reported). This work also revealed that while the frequency of underreporting is extremely high for all regions (values of \(\widehat{\omega }\) over 0.80 in all cases), the intensity of this underreporting is not uniform across the considered regions: Aragón and Ceuta are the CCAAs with highest underreporting intensity (\(\widehat{q}=0.28\)) while Región de Murcia and País Valencià are the regions where the predicted values are closest to the number of reported cases (\(\widehat{q}=0.46\)). Detailed underreporting parameters estimates for each region can be found in Table 2. Although the main impact of the vaccination programmes can be seen in mortality data, the results of this work also showed a significant decrease in the weekly number of cases as well in all CCAA except Euskadi, as can also be seen in Table 2 through the estimates corresponding to parameter \({\beta }_{2}\). Figure 2 represents the predicted and registered Covid-19 weekly incidence globally for Spain.

Table 2 Estimated underreporting parameters and impact of the considered covariates (\({\beta }_{1}\) and \({\beta }_{2}\) are the coefficients for confinement and vaccination respectively) for each Spanish CCAA. Reported values correspond to the median and percentiles 25% and 75% of the corresponding posterior distribution
Fig. 2
figure 2

Registered and predicted weekly new Covid-19 cases globally in Spain

Figure 2 shows the evolution of the registered (turquoise) and predicted (red) weekly number of Covid-19 cases in Spain in the period 2020/02/23–2022/02/27.

As can be seen in Figs. 1 and 2, there are 4 weeks (2021–12-26, 2022–01-02, 2022–01-09 and 2022–01-16) for which the predicted values coincide with those registered in all simulations, so no underreporting is detected these weeks. This behavior might be due to the breakout of a new variant with different characteristics (for instance producing more symptomatic cases and therefore reducing the underreporting) around these dates.

The registered values predicted by the model can also be obtained as \(\widehat{Y_t}=\left(1-\widehat\omega+\widehat\omega\cdot\widehat q\right)\cdot\widehat{X_t}\), and compared to the actual registered values \({Y}_{t}\). That allows computing standard forecasting error measures as Root Mean Squared Error (RMSE) or Mean Absolute Percentage Error (MAPE). Globally, the RMSE was 113,145.4 and MAPE was around 8%, ranging between 4 to 13% across regions. The specific RMSE and MAPE for each region are described in Table S1 in the Supplementary Material.

The global differences in underreporting magnitudes across regions can be represented by the percentage of reported cases in each CCAA (compared to model estimates), as shown in Fig. 3.

Fig. 3
figure 3

Percentage of reported cases in each CCAA

Discussion

Although it is very common in biomedical and epidemiological research to get data from disease registries, there is a concern about their reliability, and there have been some recent efforts to standardize the protocols in order to improve the accuracy of health information registries (see for instance [24, 25]). However, as the Covid-19 pandemic situation has made evident, it is not always possible to implement these recommendations in a proper way.

Another work analyzing the cumulated burden of Covid-19 in Spain [26] estimated that only around 21% of the cases were reported in the period 2020/01/01–2020/06/01, but it should be taken into account that it seems reasonable to assume that the underreporting intensity was higher at the early stages of the pandemic, and therefore a lower overall underreporting is expected in the longer period considered in this work. Additionally, the presented methodology allows for a real time monitoring and not only cumulated over a time period.

Having accurate data is key in order to provide public health decision-makers with reliable information, which can also be used to improve the accuracy of dynamic models aimed to estimate the spread of the disease [27] and to predict its behavior.

Conclusions

The proposed methodology can deal with misreported (over- or under-reported) data in a very natural and straightforward way and is able to reconstruct the most likely hidden process, providing public health decision-makers with a valuable tool in order to predict the evolution of the disease under different scenarios.

Using a flexible approach for the underlying hidden process, such as ARCH time series, are a natural extension to recent developments (see for instance [19]) proposed for fitting underreported time series but restricted to the case when the underlying process has an ARMA structure and allow us to model phenomena presenting more complex behavior like Covid-19 in the long time period considered in the present work.

The analysis of the Spanish Covid-19 data shows that in average only around 51% of the cases in the period 2020/02/23–2022/02/27 were reported, and that there are important differences in the severity of underreporting across the Spanish regions. The impact of the vaccination program can also be assessed, achieving a significant decrease in the Covid-19 incidence in almost all regions after 50% of the population had one dose at least (although these results would probably be notably different if including SARS-CoV-2 immunity-escape variants like BA.4 or BA.5, which are currently predominant in many countries), while the impact of the mandatory lockdown could only be detected by the model in 7 out of 19 regions.

The simulation study shows that the proposed methodology behaves as expected and that the parameters used in the simulations, under different autocorrelation structures, can be recovered, even with severely underreported data.