An epidemiological modelling approach for COVID-19 via data assimilation

The global pandemic of the 2019-nCov requires the evaluation of policy interventions to mitigate future social and economic costs of quarantine measures worldwide. We propose an epidemiological model for forecasting and policy evaluation which incorporates new data in real-time through variational data assimilation. We analyze and discuss infection rates in the UK, US and Italy. We furthermore develop a custom compartmental SIR model fit to variables related to the available data of the pandemic, named SITR model, which allows for more granular inference on infection numbers. We compare and discuss model results which conducts updates as new observations become available. A hybrid data assimilation approach is applied to make results robust to initial conditions and measurement errors in the data. We use the model to conduct inference on infection numbers as well as parameters such as the disease transmissibility rate or the rate of recovery. The parameterisation of the model is parsimonious and extendable, allowing for the incorporation of additional data and parameters of interest. This allows for scalability and the extension of the model to other locations or the adaption of novel data sources.


Introduction
The global outbreak of n-Cov2019 and the possibility of severe social and economic costs worldwide requires immediate action on suppresion measures. In order to evaluate the efficacy of past and future policy measures to fight and contain the spread of n-Cov2019, a robust and quantifiable analysis system is required. We propose a methodology for forecasting the spread of n-Cov2019 and show how to estimate latent infection rates, accounting for high uncertainty in observation and model specification, which is done by combining real-time Bayesian updating with epidemiological models.
To show the generalisability of our updating approach we first embed a standard SIR model in our framework and then develop a custom compartmental SIR model which is fit to data related to the spread of the coronavirus worldwide which we name SITR. The SITR model adds an additional compartment for patients under treatment T and allows for more granular inference on the underlying dynamics of the epidemic, separating confirmed cases under treatment with latent unconfirmed cases of Covid19. The models are embedded in a data assimilation framework, a form of recursive Bayesian estimation [1], which conducts model updates when new observations become available. The assimilation scheme lends itself naturally to the problem because the procedure allows the model to dynamically adjust infection rates in real time, while taking into account the uncertainty in the data via the specification of covariance matrices.
The uncertainty quantification and choice of covarainace matrices is being analyzed using a hybrid data assimilation approach, which is applied to make results robust to initial conditions. We use the model to infer the amount of infected people and both, the disease transmissibility rate, as well as the rate of recovery. The time-varying parameter structure of the model allows for the incorporation and analysis of policy action, such as if the shutdown of transportation or closure of schools affect transmissibility.
In line with other researchers, our model estimates indicate that the number of infected people is a number of magnitudes higher than the actual reported number of confirmed reported infections. We also find that compared to static models, updating the parameters in a dynamic fashion leads to an upward correction of the true number of infected people as well as reducing forecasting errors.
We estimate both short term and long term dynamics in Italy, the United States and the United Kingdom, finding a peak of infections in the middle of March and a flattened but sustained number of infected cases in the United States and the United Kingdom. We furthermore analyze the transmissibility rate and find that they decreased after imposed initial lockdown measures, but increased again after a loosening of restrictions end of May. The rest of the paper is structured as follows: Section two discusses related work. Section three and four introduce the dynamic model as well as the SITR compartmental model. Section five discusses how the uncertainty in the data is incorporated in the model. Section six states discusses empirical results and section seven concludes.

Related work
The spread of the novel type of respiratory virus as well as the dramatic economic consequences trying to contain it has led to a rapid engagement of the scientific community, with many different areas of research being explored.
Using compartmental models in epidemiology, authors such as [2,3] and [4] have done the first studies on the size of the outbreak in China. They applied standard SIR models with static parameters to estimate the basic reproductive number and analyze the exponential growth of the virus in Wuhan. The work of [4] in particular combines standard SEIR models with travel data obtained from Tencent and found that epidemic dynamics show exponential patterns in multiple major cities with a lag behind the Wuhan outbreak of about one to two weeks.
First studies using data assimilation for epidemiological modelling have been conducted by other authors such as [5] and [6], which studied the techniques on different cases such as influenza using standard SIR models, although none has considered the issue of the robust covariance estimates as discussed in [7] or [8].
Further studies such as [9] and [10] study time varying parameters in more detail, although only for standard SIR models with no relationship to the current corona virus outbreak. We are the first to conduct a study of the current spread of 2019-nCoV using data assimilation. We furthermore contribute by providing a novel framework which enables the prior computation of covariance matrices, adding robustness to epidemiological assimilation models. Although many compartmental models are available, we base our initial studies on SIR models, since it allows us to verify the dynamics of the assimilation scheme as a robust benchmark and compare it to extensions of the model later on. We will specify the exact model choice and specification in the next section.

The adaptive epidemiological model
We introduce an adaptive epidemiological modelling framework which combines a SIR model whose model parameters are time-varying with data assimilation techniques. We base our model on the most basic compartmental model, which is the SIR. Describing and implementing the assimilation scheme in the basic structure of a SIR model allows us to analyze its initial performance and derive additional modifications later on. Further complexities will be introduced when more granular data is available. The current confirmed cases are mostly symptomatic cases with many asymptomatic cases being unconfirmed due to limited testing capacities.
For Covid19, a SEIR model adding an compartment for exposure is a possible candidate for further extensions. But given the lack of more granular data with all confirmed infection cases being symptomatic and no other data on exposure levels, adding additional parameters with insufficient data can lead to a bias of parameter estimates in the assimilation procedure, and also increases model complexity without adding understanding of the underlying mechanics of the Covid19 infections [11,12]. The lack of meaningful and accurate data which fits the model assumptions for more complex models strongly affects the performance of data assimilation and thus makes a sparse and parsimonious model preferable [1].

The standard SIR model
We start our analysis with a standard SIR model [13], which is a system of three interrelated non-linear ordinary differential equations without an explicit analytical solution. The dynamics of the model are given by: where S denotes the susceptible population size, I the infected people who are not isolated from the population and R the recovered population. The total population is given by N. The parameters and denote the transmission and recover rate of the virus infection. Note that for the outbreak, the susceptible number S is observable, which we label as the population of the country under analysis. The recovered population R denotes the population not infectious anymore and being removed from the population, which for the studied examples is the number of confirmed cases, since confirmed cases are hospitalized and isolated and not infecting the general population anymore.

The adaptive DA-SIR model
Data Assimilation (DA) is a technique to incorporate observations into a theoretical model where uncertainty is quantified [1]. It allows for problems with uneven spatial and temporal data distribution and redundancy to be addressed such that models can ingest information. DA is a vital step in numerical modeling and has become a main component in the development and validation of mathematical models in meteorology, climatology, geophysics, geology and hydrology [14]. Recently, DA is also applied to numerical simulations of geophysical applications, medicine, biological science and finance [15]. Data assimilation can be applied to a variety of problems where an uncertainty quantification has to be included [16] or where latent parameters need to be computed taking into account new observations. The Adaptive DA-SIR model is a model which incorporates data assimilation with a compartmental SIR model. We use DA as an adaptive modelling approach which integrates new observations into our compartmental model to enhance the accuracy of forecasts as well as computing model parameters of interest, in our case and in the SIR model.
The SIR model in equations (1)-(3) can be discretized with respect to the time variable, giving the following equations: For a given time step t and assuming to have observations of the variable R t we denote here with R obs t , the DA problem consists in computing the minimum of the cost function where R pred is a predicted value generated by the SIR model, and where and denote the the background and the observation covariance matrices, representing an estimation of the errors in the data. To estimate the parameter, we minimize Data assimilation is very sensitive to initial conditions and the choices of the covariance matrices, since they quantify uncertainty and determine how much weight is assigned to new observations which are assimilated into the model. Thus their calibration needs to be properly chosen, which we outline in detail in Sect. 5.
The data we use representing S t , I t and R t is given by the official government numbers and is available at [17,18]. The solution of the DA problem in (7) leads to a modified extended Kalman filtering algorithm where an SIR model is used to compute the forward steps, e.g. in the time window [t, t + M] . Where I DA t are the values of I t computed after the assimilation of R obs t as in Eq. 8. To illustrate and put results into perspective, we compare results of our adaptive DA-SIR model with the common SIR model for an example case.
Both models use the same initial conditions given by the observed data. In Fig. 1 we compare model performance of the standard SIR model and show how assimilation of new observations generates updated model dynamics in the DA-SIR model that do differ from standard SIR model predictions Fig. 1 Comparing estimates during early stages of the outbreak of confirmed cases R and unobservable amount of infected people I in Italy, using static and updating parameters by a wide margin, as is illustrated in the figure. Selected values of the graph are available in Table 1 where we compare estimated confirmed cases and latent infection rates for both the static SIR, and dynamic DA-SIR model. The dynamic model given by the solid lines fit the observed values of confirmed cases R t and interpolates the number of infected people I t . The dashed lines represent the standard SIR model and show how not updating the model from the initial conditions leads to overestimation of infectious cases if interpolating according to simple exponential SIR dynamics, with a inferior model fit when comparing the observed confirmed cases denoted in red. This illustrates how without updating the parameters the number of infected people is overestimated and the assimilation of new observations helps to adjust the trajectory of likely infections in the future. Having shown the large difference between static and dynamic SIR models we next introduce a further refined extension of the dynamic assimilation model.

The SITR model
Having illustrated the benefits of embedding the SIR model in a DA framework, we aim to further exploit the available data to do more fine tuned inference. In the previous case of the simple SIR model, both recovered and isolated patients were categorized as R. We revise the SIR model by introducing an intermediate compartment T. Here, T represents the number of people being treated, given by the difference between accumulated confirmed cases and recovered or deceased patients R. Instead of just observing one variable, the number of confirmed cases, we are now observing two variables: the currently confirmed cases being treated T and removed infectious population due to recovery or being deceased R. The model is given by The parameters and denote the recovery and transition rate given by total of incubation and admission days. To extend the model and incorporate information not just of the last timestep, we introduce a model extension which bases model predictions on a sliding window of length , similar to a 4D-VAR approach [1]. For a given time window [t + 1, t + ] and assuming to have observations of the variable T t which we denote here with T obs t , the resulting assimilation scheme is given by  (18), where forecasts and parameter estimates are based on a sliding window over time. Without preconditioning, the algorithm updates the model parameter values with the noise and observation matrices and being fixed hyperparameters. In order to present results which may have major policy implications, correct and robust estimation of initial conditions and hyperparameters is of high importance, we therefore introduce a formalization and preconditioning of the covariance matrices and before applying the assimilation scheme, which is named hybrid data assimilation.

Uncertainty in infection rates
The data that is being observed is highly aggregated and suffers from uncertainty firstly due to human measurement error and secondly due to number of confirmed cases being a noisy subset of the true number of infections. Furthermore different definitions for confirmed cases or the cause of mortalities due to Covid19 in various countries adds noise and uncertainty to the data. The figures in Fig. 2 show example data of confirmed cases and mortalities.
This uncertainty in the data mandates a methodology that incorporates the uncertainty into the model. The DA-SIR The state and observation covariance matrices and determine the weight of new observations when updating the parameters of the model. As is stipulated in the cost function in Eq. 7, the inverse of the covariance matrices are used to weight the terms of the observation and model operator. Thus very noisy data yielding large error covariance matrices will result in less weight of the term containing the model operator, i.e. less certainty is put on the data. Therefore the next section studies different covariance matrix setups to take data uncertainty into account. We include detailed steps for the computation of the covariance matrix in the appendix, where we outline the ensemble variational approach applied to generate robust covariance matrix estimates.

Sensitivity analysis
As we mentioned in Sect. 4.1, the choice of the covariance matrices strongly affect the efficiency and the accuracy of the assimilation approach. As the available data is not accurate enough, in order to justify our estimations, we run a sensitivity analysis to study the impact of our estimated parameters and covariance matrices into the model predictions, using a subset of the data as illustration. To illustrate the hyperparameter sensitivity we compare the number of estimated infected people and we apply a mean root squared forecasting error (MRSFE) metric: where ŷ t,n represents the model prediction, y r t,n the real observation with forecast horizons defined by h = 1 , and 0 = 1 Fig. 2 Number of mortalities and confirmed cases for the United States, Italy and the United Kingdom the starting period of the forecast for n variables. The results are given in Table 2.
The results in the table show that a naive setup using unit covariance matrices leads to detrimental fit of the model with relatively high forecasting errors compared to other covariance combinations, with generally better performance for high values for observation covariance matrix , meaning that prediction accuracy increases when less weight is put on the model dynamics and more weight on the observations. The lowest forecasting error for T is given with a unit observation matrix and an error covariance matrix of 10, showing that the optimal combination of covariance matrix values is non-linear and requires a carefully considered estimation algorithm, as our proposed ensemble approach.
The computation of the covariance matrices is performed via a hybrid-assimilation approach where the covariance matrices are estimated using an ensemble approach. In this approach, the values of the covariance matrices are generated through sampling multiple synthetic trajectories using the SITR which is initialized with different parameters for each draw as well as calculating the empirical residual covariance matrix for the data. The details of the procedure are outlined in the appendix.  Table 2 and show that the dynamics are affected by the choice of the covariance matrices. The updated model assimilates new observations of infected patients and people recovered from the virus. The long run dynamics predict a recent spike in the number of infected people in the United Kingdom. The total number of people being treated in hospitals follows with a small lag and is still growing towards the end of the example set.
The trajectories for patients under treatment are similar but the dynamics for latent infections contain some discrepancies. The left hand side depicts the trajectory of estimated latent infections and patients over time using the ensemble approach with robust covariance matrices, whereas the right hand figure depicts results using a simple unit covariance setting. The left hand side depicts a much smoother and more well-behaved series of treated cases as well as infections, approximating a more reasonable stable growth path, whereas the unit covariance case would yield much more erratic, unrealistic fluctuations in infection numbers over time, with many sudden drops in infection numbers. Since the number of treated patients is observable, we use generated forecasts of treated persons as a forecasting metric to evaluate the model fit. The Tables 3 and 4 give excerpts from the forecasts values of infected and treated patients as well as the corresponding MRSFE for treated patients in hospitals. For the unit covariance Table 3 it is observable how the unrealistic dip in forecasted infections which is visible in the right side plot in Fig. 3 causes a large spike in forecasting errors for treated people beginning of April onwards. Comparing it with a hybrid assimilation approach in table 4 reveals an overall lower number of forecasting error and better fit. The sensitivity of results confirm the need for a more rigorous algorithm of covariance estimates given the few and noisy datapoints.
Comparing the left and right bottom figures of Fig. 4, the transmissibility rate shows less variation over time, which differs from the model without ensembles where strong variation is visible. Both estimates depict the downward trend of transmissibility. The high variability of the unit covariance matrix estimate implies that the transmissibility is affected more easily by external factors which change the dynamics of new infections. Thus the robust model estimates imply that, within the sample period, the transmission rate is more stable and unaffected by changes in observations because a uncertainty weight is given by the covariance matrix estimates. We next proceed to apply out methodology to the full dataset analysing the US, the UK and Italy.

Trend analysis of international data
To illustrate the flexibility of our approach we apply our analysis to an international comparison with additional results for the United States, the United Kingdom as well as Italy.
For the models we focus on the number of infected people extrapolated from the number of confirmed cases and recoveries. The data was obtained from the John Hopkins University Coronavirus Resource Center 1 . We compare data and show test results for different forecast horizons and parameter estimates. Given the confirmed coronavirus cases we infer the amount of infected people and do forecasts to estimate the approximate development of the epidemic. We estimate the model on a sample of daily data until the 28/05/20 and evaluate models based on their fit and infection curves.
Although long-term forecasts are of limited use in fast changing scenarios such as the current pandemic, they nevertheless can provide rough guidance on a potential peak of infection rates. Comparing cases for all three countries Fig. 5 depicts the long-run dynamics of the epidemic in Italy, where according to the model the peak of infections has already occurred in march, with a gradual decrease in infection numbers afterwards. The absolute number of patients under treatment decrease throughout April and May.
Comparing Italian results with the United States and the United Kingdom in Fig. 6 shows that the trajectories of Italy differs from both countries, with the UK and the US showing similar patterns. When infection numbers in Italy already peaked, the number of latent infections depicted in red has not decreased, but has merely stabilized at a high level, whereas the hospitalized cases under treatment are still growing, with a forecasted peak by end of April, which contrasts to the Italian peak in March. The number of infections are increasing within sample, as is the forecasted number of infections. The peak of latent infections stabilses at slightly below half a million cases in the US and 100,000 cases in the UK. For both countries the total number of infections tappers off by the end of September, whereas this already happens by July for the case of Italy.
The different levels of infections are likely due to different inception dates of the pandemic, having started earlier in Italy than the United States, with an eventual peak of the United States not visible yet given the current data sample.
The results indicate that the pandemic has reached a peak in Italy recently, the dynamics for the United Kingdom and especially the United States indicate that no plateau has been reached yet and that the number of infections is likely to increase.

Short term dynamics
The results given by the dynamic SITR model highlight the different phases of development in Italy and the United Kingdom and the United States, as is depicted in Figs. 7, 8 and 9 respectively. The figures depict the predicted latent infection numbers, observed and predicted hospital treatment numbers as well as the number of recoveries. We also provide accompanying tables where we report the MRSFE fit of observed confirmed and treated cases. We first discuss the trajectories of infections and follow up with a discussion on the parameter estimates in the next section.
Results align with the previous analysis where in Fig. 7 it is observable that in Italy the number of latent asymptomatic cases has decreased with the majority of patients being hospitalized with a decreasing trend for both latent infections and hospitalized patients. The pattern of latent infections follows a shaped curve, with the majority of infections already peaked in march and now being on a trajectory exhibiting signs that the pandemic is under control. Table 5 provides additional details for the model performance, showing the number of latent infections and treated patients in the model. We show the forecasting errors when fitting the model in order to be able to compare model fit between all three countries. The forecasting errors depict how the model fits the data, with the estimates of treated patients performing particularly well compared to the UK and US (Tables 6 and 7). At the end of the sample on the 28th of May the number of latent infected patients is slightly above 180,000 and the number of patients under treatment is below 48,000, with the values exemplifying the decrease from the previous month. Overall number of infected but not hospitalized patients increases initially and are starting to trend downward after a peak at the end of March.
The hospitalization numbers are displaying a curve like behaviour, following the infections with a lag. The maximum number of patients under treatment are reached in the middle of April, with a further flattening curve towards the end of the sample. This is in contrasts to the United States and the United Kingdom as is visible in Figs. 9 and 8 where the number of latent infections has been relatively constant and exhibits less of an downward trending behaviour. With the number of hospitalized infections increasing, this is likely due to the early relaxation and less stringent quarantine restriction in the United States compared to Italy. When inspecting both infected and hospitalized compartments, results of the SITR model aligns with policy choices, with a flattening tendency for Italy, contrasting with results for the United States and the United Kingdom which followed later or with less rigorous quarantine restrictions. Table 6 depicts selected entries and forecasting errors for the United Kingdom, where at the end of the sample in May the amount of latent infections is estimated to be around 209,000. Table 7 shows selected values for the United States. Even taking account the higher amount of infection numbers in the United States, the high amount of prediction errors is noticeable, indicating that the data is noisy ad the model fit is not performing as well as in the other two cases.
Given these results the number of infections in the United States and the United Kingdom are likely to keep on growing, especially if the government is not considering the implementation of tighter regulations and quarantine measures. The results furthermore demonstrate how the assimilation framework can be extended to multiple countries and provide robust results given the large uncertainty in infection estimates.

Lockdown effects on transmissibility
To compare the predicted dynamics of our model estimates and to evaluate policies, we extend the analysis and discuss the dynamics of the model parameters. Following the same framework, we analyze the estimated transmissibility rate  over time in all three countries for which we depict weekly averaged results. Italy has seen the earliest surge of Covid19 cases but kept its quarantine measures intact. The US and UK were followed by rising surges with a delay, but have had a looser approach to quarantine measures. As given in Fig. 10, the infection rate reached a high level of 0.3 from March 10th to to around March 20th, followed by a gradual decreasing period with an infection rate bottoming out at a value of 0.2, showing initial successes in lockdown measures, which were enforced from the 9th of March onwards. This shows how the lockdown order has first stabilized the transmissibility rate and later led to a decrease. Towards the end of the sample with increasing relaxation of the lockdown measures an increase in is observable. Variation of the parameters is very high towards the end of the sample, especially compared to examples of the UK and US, being evidence that the model parameter estimates are less clear on a strong increase in infection rates as in the US or the UK.
Overall,the Italian dynamics in Figs. 7 and 10 clearly indicate that the number of undetected latent infections is decreasing, with also the number of known and hospitalized cases having crossed their peak value as well and approaching a very low value in the sample. The infection rates indicate a decrease after the enforcement of quarantine measures, with a increase in transmissibility towards the end of the sample, although the model parameters exhibit strong variation in parameter estimates.
The United States have not reached a similar level as Italy. The trajectories in Fig. 11 illustrate that after a rapid growth and high peak value of transmissibility on the 26th of March, a strong decrease followed, showing the effectiveness of the lockdown measures that were enforced in many states from the 23th of March onwards. After an initial strong decrease of the transmissibility rate after imposing a lockdown, the infection rate has strongly increased again at the end of the sample, with the transmissibility values increasing from a trough of 0.18 on the 19th of April to 0.7 at the end of May. This, together with the trajectories in Fig. 9 show that restriction measures have only lead to initial successes with a stabilization of latent infection cases with the total number of cases still increasing. Especially towards the end of the sample at the end of May the amount of infections is increasing again.
The development of the United Kingdom is very similar to the United States, where initial lockdown measures managed to decrease the number of latent infections, with the initial growth in infection numbers in the middle of March is not as strong as in the Italy or the US. This is explainable when taking into account the parameter estimates given in Fig. 12. The parameter plot visualises this development clearly, where transmissibility has initially decreased from the 4th of April onward and stabilized at a level of 0.2 and steadily increased from the end of April onwards at the end of the sample. The parameter estimates unequivocally indicate a strong increase in infection rates, with the last estimated peaking at around 0.6, although the relative increase from it's lowest value throughout April is not as pronounced as in the US indicating that the increase in transmissibility has not been affected as much by changing policies or behaviour in the population as in the US.
Overall the results indicate that Italy is progressing well in containing the virus, the UK and especially the US are struggling to contain the virus in the medium-term. Especially the parameters indicate worsening developments, where in all three countries transmissibility has decrease initially due to quarantine measures but increased towards the end of the sample, although evidence is less clear and more uncertain for the Italian transmissibility rates.

Conclusion and future work
We introduced a novel epidemiological assimilation scheme to forecast and evaluate the current corona pandemic worldwide with a specific focus on the United States, the United Kingdom and Italy. We combined compartmental models in epidemiology with data assimilation schemes showing the advantage of real-time forecasting and parameter estimation in the current crisis. We discussed the benefits and differences in infection numbers when models are updated on a daily basis compared to static modelling. We then introduced a model extension allowing us to observe patients being treated, and patients being removed from the infectious population, which we labelled SITR. Since models are sensitive to estimates of the covariance matrices, we add a hybrid ensemble approach which allows for robust covariance matrix estimates. We find that in Italy the peak of infections has been reached already, with the number of patients being treated peaking middle of April. The trajectories of the US and UK are less clear, with a likely increase in the medium term, with both countries showing a strong increase in transmissibility rates after an initial decrease due to lockdown measures.
The generalisability of our model allows the addition of different compartments to the model, and also allows for the implementation in a variety of cases and countries, where in our experiments the model gives forecasts and parameter estimates for three different countries. Since this work focused mainly on the methodology of providing a robust recursive Bayesian estimation for the current nCov-2019 outbreak, we propose a further in depth-study of the parameter estimates and an extended comparative study across countries. Future work can add further complexities to the model, such as taking into account different mortality rates due to population age, cultural norms or quality of the healthcare system, providing applicability and robustness of the model for different datasets and scenarios.
We encourage both researchers and policymakers to run similar test results with data from other countries or on a more local level to estimate potential infection rates of outbreaks and the rate of transmission to implement the correct policy measures to contain and mitigate adverse effects of the pandemic. then estimated and used to sample the ensemble members from.
Algorithm 1 describes in detail how ens is computed and ensembles are formed. The full background state matrix, b is split into N groups each of size n × n N . Both, the means as well as the standard deviations of the n rows are estimated and used to generate draws from a multivariate Gaussian distribution to form the ensemble. In order to form ens , for each ensemble member the corresponding mean is estimated and then subtracted, computing the standard deviation. To put results into perspective we discuss the difference between standard assimilation and hybrid approaches by conducting a sensitivity analysis next.