In parallel with the global SARS-COV-2 pandemic, several data-assimilation practitioners have implemented systems to predict pandemic development. Common for several of these studies is an approach of recursive updating of the model’s state variables using EnKFs or even particle filters. However, the problem is closer to a parameter-estimation problem than state estimation . Evensen et al. (2020) presented a new assimilation process for predicting the SARS-COV-2 pandemic. They used an approach of combined parameter and control-variable estimation. The reason is that the effective reproductive number R(t) drives the model. And it is a function of time, as it is the public behavior that determines its value. Another exciting aspect is that today’s number of observed hospitalizations and deaths results from people’s behavior about two weeks earlier. In their paper, Evensen et al. (2020) showed that the assimilation system, which used ESMDA to estimate parameters and the time-dependent control, R(t), was capable of tracking the pandemic over multiple “waves” and making predictions with          realistic uncertainty estimates.

1 An Extended SEIR Model

Evensen et al. (2020) developed an extended SEIR (susceptible, exposed, infectious, and recovered) model (Blackwood & Childs, 2018) for predicting the SARS-COV-2 pandemic. The model has multiple age classes (since the COVID-19 disease affects different age groups differently) and includes compartments for quarantined, hospitalized, and dead, with additional separation into those with mild, severe, and fatal symptoms.

Figure 22.1 gives an overview of the model where we have stratified the susceptible, exposed, and infectious populations into age groups \({\mathbf {S}}_i\), \({\mathbf {E}}_i\), and \({\mathbf {I}}_i\) following Cao and Zhou (2012). As in the standard SEIR model, the infectious and susceptible interaction leads to the newly exposed. The effective reproductive numbers between different age groups \(R_{ij}\) together with the infection time scale \(\tau _\text {inf}\) determine the rate of new infections. We will discuss the formulation used for \(R_{ij}\) in detail below. Note that the susceptible and infectious interaction constitutes the only source of nonlinearity in the model.

The different age groups of infectious \({\mathbf {I}}_i\), transition into the various groups of quarantined sick, \({\mathbf {Q}}_\mathrm {m}\), \({\mathbf {Q}}_\mathrm {s}\), and \({\mathbf {Q}}_\mathrm {f}\), based on the fractions \(p_\mathrm {m}^i\), \(p_\mathrm {s}^i\), \(p_\mathrm {f}^i\), and the infection time scale \(\tau _\text {inf}\). The fractions refer to patients with mild symptoms, hospitalized patients with severe symptoms, and fatally ill patients and specify how the virus affects people in different age groups. The subscripts m, s, and f refer to mild, severe, and fatal symptoms. Thus, the model includes different probabilities for dying or being hospitalized dependent on the age group and accounts for how the SARS-CoV-2 virus affects older people more severely. We have assumed that a patient will not infect anyone while in a quarantined group.

Fig. 22.1
figure 1

Flow diagram of the SEIR model

The patients with mild symptoms in \({\mathbf {Q}}_\mathrm {m}\) will recover and transition into the group of recovered with mild symptoms \({\mathbf {R}}_\mathrm {m}\), on a time scale \(\tau _\text {recm}\), without going to the hospital. Severely sick patients in \({\mathbf {Q}}_\mathrm {s}\) transfer to the hospital compartment \({\mathbf {H}}_\mathrm {s}\), on a time scale \(\tau _\text {hosp}\). After that, they recover on a time scale \(\tau _\text {recs}\) into the compartment of patients recovered from severe disease \({\mathbf {R}}_\mathrm {s}\).

The model admits the fatally-ill patients in \({\mathbf {Q}}_\mathrm {f}\) to a hospital \({\mathbf {H}}_\mathrm {f}\) on the time scale \(\tau _\text {hosp}\). However, we also allow for a fraction of fatally-ill patients not admitted to a hospital, and we model them in \({\mathbf {C}}_\mathrm {f}\). The purpose of the \({\mathbf {C}}_\mathrm {f}\) variable is to include the fatally-ill patients not measured as hospitalized. Introducing \({\mathbf {C}}_\mathrm {f}\) allows us to use realistic fractions \(p_\mathrm {f}\) of fatally-ill patients and still condition on the measured hospitalization numbers \({\mathbf {H}}_\mathrm {s}+{\mathbf {H}}_\mathrm {f}\). This partition of the fatally-ill patients was essential for most cases discussed in the paper. The fatally ill patients in \({\mathbf {H}}_\mathrm {f}\) and \({\mathbf {C}}_\mathrm {f}\) end up in the group of dead \({\mathbf {D}}\) on a time scale \(\tau _\text {death}\). Later, we added compartments of vaccinated \({\mathbf {V}}_i\), which was essential to match the data after the vaccinations started in 2021.

Fig. 22.2
figure 2

Ensemble predictions for Norway including the impact of vaccinations

A challenging property of the model is that the time-dependent effective reproductive number R(t) is the primary driver of its evolution. Furthermore, R(t)’s value about two weeks ago determines today’s number of hospitalizations and deaths. Hence, standard sequential state estimation, Sect. 2.4.2, is not appropriate for this problem, as the model will be strongly biased unless we correct R(t). It turns out that the best approach to solve this problem is to consider it as a combined initial-value-, parameter-, and model-control estimation problem. This application corresponds to the problem definition in Sect. 2.4.6. Thus, we define the assimilation window to include the simulation period from the pandemic’s onset until today. We estimate the initial number of exposed and infectious, i.e., the initial conditions, and all the model’s static time scales. Still, the dominating parameter in the model is the control variable R(t). In the model example below, we also included the impact of vaccinations. Unless we explicitly model the vaccination, it is impossible to fit the model prediction to the observations.

2 Example

In this application, we used ESMDA with \(M=32\) steps and a vast ensemble size of \(N=5000\) realizations (see the sensitivity experiments below). In Fig. 22.2, we show results from one such experiment for Norway. The upper plot shows the ensemble predictions of current hospitalizations in red and accumulated deaths in green, and the black dots are the observations. The ensemble of effective R(t) controls, shown in blue, drives the time evolution of the pandemic. We note the immediate sharp drop in R(t) in mid-March when Norway shut down, and we estimate R(t) values around 0.4 at the end of March. This reduction of R(t) to below one leads to an efficient decline in hospitalizations and deaths. The pandemic was essentially over at the end of July. Then, during July and August, the government allowed for vacation travels outside Norway, and during August and September, many foreign workers came back to Norway. Norway had, at this time, no effective quarantine system, and the virus started spreading again throughout society with several prominent local outbreaks. We see that the estimated R(t) is above one from the second half of July until the end of the year. This increase in R(t) accounts for the relatively weak restrictions on social contact between people and the additional imported cases as we do not explicitly model them. The increase in R(t) led to the second wave during November and December, followed by a strict lockdown in January. This new lockdown initially seemed to control the pandemic and reduced the number of cases. However, during March 2021, we have experienced another steep increase in infections caused by higher values of R(t) partly caused by the introduction of mutated, more infectious viruses. The vaccinations of the adult population, which started in January, helped control the further pandemic growth.

Fig. 22.3
figure 3

The plots show the estimated solution of current hospitalized in red and accumulated deaths in green for increasing ensemble sizes of \(N=100\), \(N=1000\), and \(N=5000\). The black dots are the measurements. The left and right columns show results from two random seeds. All the cases used 32 MDA steps

Fig. 22.4
figure 4

The plots show the estimated effective reproductive number R(t) for increasing ensemble sizes of \(N=100\), \(N=1000\), and \(N=5000\). The left and right columns show results from two random seeds corresponding to the plots in Fig. 22.3

3 Sensitivity to Ensemble Size

We have examined the convergence of ESMDA as a function of the ensemble size. ESMDA being a Monte Carlo algorithm means that we can continually improve the solution by increasing the ensemble size. However, we need to decide on a tradeoff between ensemble size and the number of ESMDA steps due to limitations on computing power. While the number of ESMDA steps impacts the actual convergence of the algorithm towards the right solution, the ensemble size impacts the precision of the statistical estimate of the final solution. We find it most important to first converge to the correct physical solution and, then, use an as large as possible ensemble to reduce the sampling errors.

Fig. 22.5
figure 5

The left plots show the estimated solution of hospitalized in red and accumulated deaths in green. The right plots show the corresponding effective reproductive number R(t). From top to bottom, the results are from ESMDA with \(N_\text {MDA}=1\), 2, 4, 8, and 16

Figures 22.3 and 22.4 show the results using various ensemble sizes and two different random seeds. For all cases, we used a sufficient number \(N_\text {MDA}=32\) to ensure a converged ESMDA solution. These plots demonstrate the robustness of ensemble-based assimilation methods. Even using 100 realizations, the posterior predictions are consistent with the data and very similar to the cases with larger ensemble sizes. There is a visual effect on the update from the random seed that we see more clearly in the estimated R(t). When using 1000 realizations, there is still a significant difference in the estimated R(t). It is hard to note any dependency on the seed or difference from the case with 5000 realizations for the predictions. When we increase the ensemble size to 10,000 realizations, we do not see any difference in the parameter estimates or forecasts. For this reason , Evensen et al. (2020) used  5000 realizations for all the simulations in their paper.

4 Sensitivity to MDA Steps

We include a sensitivity experiment to examine the convergence properties of the ESMDA algorithm with to the number of MDA steps. We expect that the accuracy of the solution will improve with the number of steps until a certain level where there is nothing more to gain. The required number of steps is, of course, dependent on the model’s nonlinearity. Evensen (2018)  examined  the convergence of ESMDA for a simple nonlinear scalar case and obtained minimal improvement after 16–32 steps. See also the ESMDA example in Chap. 18. Figure 22.5 presents the posterior solution for deaths and hospitalized, and corresponding estimates of R(t) when using ESMDA with 1, 2, 4, 8, and 16 steps. From visual inspection, it is hard to justify more than 16 steps. In  Evensen et al. (2020), the authors decided to  use 32 MDA steps and 5000 realizations in all simulations to ensure convergence and eliminate the possibility of sampling errors. Each experiment required a few minutes of computation on a powerful laptop with this simple model.

5 Summary

From the ESMDA implementation with the SEIR model, we have seen that we can accurately estimate the time-dependent R(t) that ensures an excellent fit to the data up till about two weeks before the final data point. From there on, we need to specify R(t) based on what we know about ongoing and planned lockdowns, the versions of mutated viruses, and the population’s general behavior. There is an apparent predictive skill in the system since every action we take or intervention we implement will only influence the data a couple of weeks into the future. Thus, we only see the impact of a lockdown two weeks later. The sensitivity experiments shed light on the effects of the ensemble size on sampling errors in ensemble methods and provide insight into the convergence properties of ESMDA. This example also illustrates how we can estimate both model parameters and the forcing (controls) to constrain the model to follow the observations. Thus, we have great flexibility in formulating the problem and ensuring physically reliable solutions using data assimilation.