Background

The novel coronavirus (severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)) that erupted from China is a deadly respiratory pathogen that belongs to the same family as the coronavirus responsible for the 2002–2003 severe acute respiratory syndrome (SARS) outbreaks [1]. Since the first cluster of cases was identified in Wuhan City, China, in December 2019, the novel coronavirus disease 2019 (COVID-19) continues its relentless march around the world as of May 12, 2020 [2]. Nevertheless, China was hit hard by this emerging infectious disease, especially the city of Wuhan in Hubei Province, where the first cluster of severe pneumonia caused by the novel virus was identified. Meanwhile, the cumulative number of laboratory and clinically confirmed cases and deaths in mainland China has reached 82,918 and 4633, respectively, as of May 10, 2020 [3].

Because the morbidity and mortality burden associated with the novel coronavirus has disproportionally affected the city of Wuhan, the center of the epidemic in China, the central government of the People’s Republic of China imposed a lockdown and social distancing measures in this city and surrounding areas starting on January 23, 2020. Indeed, out of the 82,918 COVID-19 cases reported in China, 50,339 cases (60.7%) are from Wuhan City. In terms of the death count, a total of 3869 deaths (83.5%) have been recorded in Wuhan City out of the 4633 deaths reported throughout China. To guide the effectiveness of interventions, it is crucial to gauge the uncertainty relating to key epidemiological parameters characterizing the transmissibility and the severity of the disease. Despite the scarcity of publicly available data, scientists around the world have made strides in estimating the magnitude of the epidemic, the basic reproduction number, and transmission patterns [4, 5]. Moreover, accumulating evidence suggests that a substantial fraction of the infected individuals with the novel coronavirus show little if any symptoms, which suggest the need to reassess the transmission potential of this emerging disease [6]. For this purpose, in this study, we employ statistical methods and publicly available epidemiological datasets to jointly derive estimates of transmissibility and severity associated with the novel coronavirus.

Methods

Epidemiological data

We linked our model to two different datasets. First, the daily series of laboratory-confirmed COVID-19 cases and deaths in Wuhan City were extracted according to the date of symptom onset or reporting date from several sources [3, 7, 8]. Our analysis relies on epidemiological data reported prior to February 11, 2020, because of the change in case definition that was announced on February 12, 2020 [9]. As of February 11, 2020, a total of 19,559 confirmed cases including 820 deaths were reported in Wuhan City. Second, epidemiological data of Japanese evacuees from Wuhan City on board government-chartered flights were obtained from the Japanese government. After arriving in Japan, all of the Japanese evacuees were kept in isolation for about 14 days and examined for infection using polymerase chain reaction (PCR) tests [10]. As of February 11, a total of four flights with the Japanese evacuees left Wuhan City. We collected information on the timing of the evacuee fights that left Wuhan City as well as the number of passengers that tested positive for COVID-19 in order to calibrate our model (Table S1).

Statistical analysis

Using the following integral equation model, we estimate the reproduction number of COVID-19. Here, infected and reported cases are denoted by i and c, respectively.

We connected the daily incidence series with a discrete–time integral equation to describe the epidemic dynamics. Let gs denote the probability mass function of the serial interval, e.g., the time from illness onset in a primary case to illness onset in the secondary case, of length s days, which is given by

$$ {g}_{\mathrm{s}}=G\left(\mathrm{s}\right)-G\left(\mathrm{s}-1\right), $$

For s > 0 where G(.) represents the cumulative distribution function of the gamma distribution. Mathematically, we describe the expected number of new cases with day t, E[c(t)] as follows:

$$ E\left[c(t)\right]=\sum \limits_{s=1}^{\infty}\mathrm{E}\left[c\left(t-s\right)\right]R, $$

where E[c(t)] represents the expected number of new cases with onset day t and where R represents the average number of secondary cases per case.

Subsequently, we also employed the time-dependent variation in R to estimate the impact of enhanced interventions on the reproduction number. This time dependence was modeled by introducing a parameter δ1, which is given by

$$ {\delta}_1=\left\{\begin{array}{ccc}1& & \mathrm{otherwise}\\ {}{\beta}_1& & \mathrm{if}\ t={\mathrm{period}}_1\\ {}{\beta}_2& & \mathrm{if}\ t={\mathrm{period}}_1\end{array}\right. $$

where period1 and period2 represent the corresponding period from January 23 to February 2, 2020, and from February 3 to February 11, 2020, respectively. January 23, 2020, is the date when the central government of the People’s Republic of China imposed a lockdown in Wuhan and other cities in Hubei in an effort to quarantine the epicenter of the coronavirus (COVID-19) to mitigate transmission [11]. Furthermore, we evenly divide the interval into two periods to incorporate the time-dependent effects on R using the parameters β1 and β2 which scale the effects of the intervention, taking values smaller than 1.

To account for the probability of occurrence, θ [12], we assume that the number of observed cases on day t, h(t), occurred according to a Bernoulli sampling process, with the expected values E(ct;Ht–1), where E(ct; Ht–1) denotes the conditional expected incidence on day t, given the history of observed data from day 1 to day (t − 1), denoted by Ht−1. Thus, the number of expected newly observed cases is written as follows:

$$ \mathrm{E}\left[h(t);{H}_{t-1}\right]=\Big\{{\displaystyle \begin{array}{ccc}\left(1-\theta \right)+\theta \mathrm{E}\left[c(t);{H}_{t-1}\right],& & \mathrm{if}\ h=0,\\ {}\theta \mathrm{E}\left[c;{H}_{t-1}\right],& & \mathrm{otherwise},\end{array}} $$

Further, we model the time-dependent variation in the reporting probability. This time dependence was modeled by introducing a parameter δ2, which is given by

$$ {\delta}_2=\left\{\begin{array}{ccc}{\alpha}_1,& & \mathrm{if}\ t={\mathrm{period}}_3,\\ {}{\alpha}_2,& & \mathrm{if}\ t={\mathrm{period}}_4,\\ {}1,& & \mathrm{otherwise},\end{array}\right. $$

where period3 and period4 represent the corresponding periods from the start of our study period to January 16 and from January 17 to January 22, respectively, while α1 and α2 scale the extent of the reporting probability (where α1 and α2 are expected to be smaller than 1). We evenly divide the time interval before the lockdown was put in place into two periods in order to incorporate the time dependency of the reporting probability. The number of expected newly observed cases should be updated as

$$ \mathrm{E}\left[h(t);{H}_{t-1}\right]=\Big\{{\displaystyle \begin{array}{ccc}\left(1-\theta \right)+ q\delta \theta \mathrm{E}\left[c(t);{H}_{t-1}\right],& & \mathrm{if}\ {h}_a=0,\\ {} q\delta \theta \mathrm{E}\left[c(t);{H}_{t-1}\right],& & \mathrm{otherwise},\end{array}} $$

We assume the incidence, h(t), is the result of the binomial sampling process with the expectation E[h]. The likelihood function for the time series of observed cases that we employ to estimate the effective reproduction number and other relevant parameters is given by:

$$ {L}_1\left(U;c\right)=\prod \limits_{t=1}^T\left(\begin{array}{c}\mathrm{E}\left(h(t);H\left(t-1\right)\right)\\ {}c(t)\end{array}\right){q}^{c(t)}{\left(1-q\right)}^{\mathrm{E}\left(h(t);H\Big(t-1\right)\Big)-c(t)}, $$

where U indicates parameter sets that are estimated from this likelihood.

Subsequently, the conditional probability of non-infection given residents in Wuhan City at the time point of ti, pti, was assumed to follow a binomial distribution, and the likelihood function is given by:

$$ {L}_2\left({p}_{t_i};{M}_{t_i},{m}_{t_i}\right)=\left(\begin{array}{c}{M}_{t_i}\\ {}{m}_{t_i}\end{array}\right){p_{t_i}}^{m_{t_i}}{\left(1-{p}_{t_i}\right)}^{M_{t_i}-{m}_{t_i}}, $$

where Mti and mti are the number of government-charted flight passengers and non-infected passengers at the date of ti, respectively, and pti is the proportion of the estimated non-infected population in Wuhan at the date of ti, calculated from the h(t) and catchment population in Wuhan City [3, 13].

Serial interval estimates of COVID-19 were derived from previous studies of COVID-19, indicating that it follows a gamma distribution with the mean and SD at 6.0 and 2.9 days, respectively, based on ref. [14, 15]. The maximum value of the serial interval was fixed at 28 days as the cumulative probability distribution of the gamma distribution up to 28 days reaches 1.000.

Infection fatality ratio

Crude CFR and crude IFR are defined as the number of cumulative deaths divided by the number of cumulative cases or infections at a specific point in time without adjusting the time delay from illness onset or hospitalization to death. Next, we employed an integral equation model in order to estimate the real-time IFR. First, we estimated the real-time CFR as described elsewhere [16,17,18]. For the estimation, we employ the delay from hospitalization to death, fs, which is assumed to be given by fs = F(s) − F(s − 1) for s > 0 where H(s) follows a gamma distribution with mean 10.1 days and SD 5.4 days, obtained from the available observed data [19].

$$ {L}_3\left(\pi; {c}_t,{D}_{t_i}\right)=\prod \limits_{t_i}\left(\begin{array}{c}\sum \limits_{t=1}^{t_i}{c}_t\\ {}{D}_{t_i}\end{array}\right),{\left(\pi \frac{\sum \limits_{t=2}^{t_i}\sum \limits_{s=1}^{t-1}{c}_{t-s}{f}_s}{\sum \limits_{t=1}^{t_i}{c}_t}\right)}^{D_{t_i}}{\left(1-\pi \frac{\sum \limits_{t=2}^{t_i}\sum \limits_{s=1}^{t-1}{c}_{t-s}{f}_s}{\sum \limits_{t=1}^{t_i}{c}_t}\right)}^{\sum \limits_{t=1}^{t_i}{c}_t-{D}_{t_i}} $$

where ct represents the number of new cases with reported day t and Dti is the number of new deaths with reported day ti [16,17,18]. We assume that the cumulative number of observed deaths, Dt, is the result of the binomial sampling process with probability π. Subsequently, crude IFR and time–delay adjusted IFR are calculated using the estimated π and ht.

The total likelihood is calculated as L = L1L2L3, and model parameters were estimated using a Monte Carlo Markov Chain (MCMC) method in a Bayesian framework. Posterior distributions of the model parameters were estimated based on sampling from the three Markov chains. For each chain, we drew 100,000 samples from the posterior distribution after a burn-in of 20,000 iterations. Convergence of MCMC chains was evaluated using the potential scale reduction statistic [20, 21]. Estimates and 95% credibility intervals for these estimates are based on the posterior probability distribution of each parameter and based on the samples drawn from the posterior distributions. All statistical analyses were conducted in R version 3.5.2 (R Foundation for Statistical Computing, Vienna, Austria) using the “rstan” package.

Results

The daily series of COVID-19 laboratory-confirmed incidence and cumulative incidence in Wuhan in 2019–2020 are displayed in Fig. 1. Overall, our dynamical models yield a good fit to the temporal dynamics (i.e., incidence, cumulative incidence) including an early exponential growth pattern in Wuhan. In incidence data, a few fluctuations are evident, probably indicating that the surveillance system likely missed many cases during the early transmission phase (Fig. 1).

Fig. 1
figure 1

Observed and posterior estimates of the daily new cases and cumulative cases of the COVID-19 cases in Wuhan, China, 2019–2020. Observed and posterior estimates of laboratory-confirmed reported cases (a) and cumulative reported cases (b) are presented. Observed data are presented in the dot, while the dashed line indicates 50 percentile, and areas surrounded by light gray and deep gray indicate 95% and 50% credible intervals (CrI) for posterior estimates, respectively. Epidemic day 1 corresponds to the day that starts at January 1, 2020

Our posterior estimates of basic reproduction number (R) in Wuhan City, China, in 2019–2020 were estimated to be 3.49 (95% CrI 3.39–3.62). The time-dependent scaling factors quantifying the extent of enhanced public health intervention on R are 0.99 (95% CrI 0.95–1.00), declining R to 3.44 (95% CrI 3.36–3.52) from January 23 to February 1, and 0.24 (95% CrI 0.23–0.26), declining R to 0.84 (95% CrI 0.81–0.88) from February 2 to February 11, 2020. Other parameter estimates for the probability of occurrence and reporting rate are 0.97 (95% CrI 0.84–1.00) and 0.010 (95% CrI 0.007–0.014), respectively. Moreover, the time-dependent scaling factor quantifying the extent of reporting rate, α, is estimated to be 0.07 (95% CrI 0.03–0.18) before January 16 and to be 0.99 (95% CrI 0.96–1.00) from January 17 to January 22.

We conducted sensitivity analyses to examine how varying the mean serial interval between 5.0 and 7.0 days affects our R estimates. R estimates are sensitive to changes in the serial interval, ranging from 2.86 (95% CrI 2.79–2.96) to 4.10 (95% CrI 3.96–4.38).

The total number of estimated laboratory-confirmed cases (i.e., cumulative cases) is 18,967 (95% CrI 16,428–19,680), while the actual number of reported laboratory-confirmed cases during our study period is 19,559 as of February 11, 2020. Moreover, we inferred the total number of COVID-19 infections (Figure S1). Our results indicate that the total number of infections (i.e., cumulative infections) is 1,906,634 (95% CrI 1,373,500–2,651,124).

The observed and posterior estimates of the cumulative number of deaths from COVID-19 in Wuhan are displayed in Fig. 2, and model-based posterior estimates of the cumulative number of deaths are 821 (95% CrI 751–892), while the actual number of reported deaths is 820. The estimated temporal variation in the death risk caused by COVID-19 in Wuhan, China, 2019–2020 is shown in Fig. 3 and Figure S2. The observed and posterior estimates of the crude CFR in Wuhan City is presented in Fig. 2a, while the observed and posterior estimates of time–delay-adjusted CFR is shown in Fig. 2b. Furthermore, Fig. 3a and b illustrates time–delay not-adjusted IFR and time–delay adjusted IFR, respectively.

Fig. 2
figure 2

Observed and posterior estimates of the cumulative deaths of the COVID-19 in Wuhan, China, 2019–2020. Observed and posterior estimates of the cumulative deaths of the COVID-19 in Wuhan are presented. Observed data are presented in the dot, while the dashed line indicates 50 percentile, and areas surrounded by light gray and deep gray indicate 95% and 50% credible intervals (CrI) for posterior estimates, respectively. Epidemic day 1 corresponds to the day that starts at January 1, 2020

Fig. 3
figure 3

Temporal variation of the infection fatality risks caused by COVID-19 in Wuhan, China, 2019–2020. a Posterior estimates of crude infection fatality ratio in Wuhan City. b Posterior estimates of time–delay-adjusted infection fatality ratio in Wuhan City. Black dots show observed data, and light and dark indicate 95% and 50% credible intervals for posterior estimates, respectively. Epidemic day 1 corresponds to the day that starts at January 1, 2020

The latest estimate of the crude CFR and time–delay-adjusted CFR in Wuhan appeared to be 4.3% (95% CrI 3.9–5.0%) and 12.2% (95% CrI 11.4–13.1%), respectively, whereas the latest model-based posterior estimates of time–delay not-adjusted IFR and adjusted IFR, presented in Fig. 3c and d, are 0.04% (95% CrI 0.03–0.06%) and 0.12% (95% CrI 0.08–0.17%), respectively, while the observed crude CFR is calculated to be 4.06% (Table 1).

Table 1 Death risk by COVID-19 in Wuhan City, China, 2020 (as of February 12, 2020)

Discussion

In this study, we derived estimates of the transmissibility and virulence of COVID-19 in Wuhan City, China, by reconstructing the underlying transmission dynamics using multiple data sources. Applying dynamic modeling, the reproduction number, death risks, and probabilities of occurrence and reporting rate were estimated.

Our posterior estimates of basic reproduction number (R) in Wuhan City, China, in 2019–2020 are calculated to be 3.49 (95% CrI 3.39–3.62). The time-dependent scaling factor quantifying the extent of enhanced public health intervention on R is 0.99 (95% CrI 0.95–1.00), declining R to 3.44 (95% CrI 3.36–3.52) from January 23 to February 1, and a scaling factor at 0.24 (95% CrI 0.23–0.26), declining R to 0.84 (95% CrI 0.81–0.88) for February 2 to February 11, 2020. These R estimates capturing the underlying transmission dynamics modify the impact of COVID-19, with the total number of infections (i.e., cumulative infections) estimated at 1,906,634 (95% CrI 1,373,500–2,651,124) in Wuhan City, raising the proportion of infected individuals to 19.1% (95% CrI 13.7–26.5%) with a catchment population in Wuhan City of 10 million people. Our estimates of mean reproduction number reached values of 3.44, an estimate consistent with previous mean estimates in the range 2.2–3.8 derived by fitting epidemic models to the initial growth phase of the observed case incidence [14, 22, 23]. By comparison, the R estimate for the Diamond Princess cruise ship in Japan reached values as high as ~ 11 [24]. Further, these estimates are higher than recent mean R estimates derived from the growth rates of the COVID-19 outbreaks in Singapore (R ~ 1.1) [25] and Korea (R ~ 1.5) [26].

The sustained high R values in Wuhan City even after the lockdown and mobility restrictions suggest that transmission continues inside the household or amplified in healthcare settings [19], which is a landmark of past SARS and MERS outbreaks [27, 28]. Considering the potent transmissibility of COVID-19 in confined settings, as illustrated by COVID-19 outbreaks aboard cruise ships, including the Diamond Princess cruise ship, where the total number of secondary or tertiary infections reached 705 among more than 3700 passengers as of February 28, 2020, and also by the COVID-19 outbreak tied to the Shincheonji religious sect in South Korea where church members appear to have infected from seven to 10 people [29,30,31], it is crucial to prevent transmission in confined settings including hospital-based transmission by strengthening infection control measures as well as transmission stemming from large social gatherings.

Our most recent estimates of the crude CFR and time–delay-adjusted CFR for Wuhan City are at 4.3% (95% CrI 3.9–5.0%) and 12.2% (95% CrI 11.4–13.1%), respectively. In contrast, our most recent crude IFR and time–delay adjusted IFR are estimated to be 0.04% (95% CrI 0.03–0.06%) and 0.12% (95% CrI 0.08–0.17%), respectively, which are substantially smaller than the crude CFR estimated at 4.06%. Our estimates are also lower compared to other recent estimates of the infection fatality ratio at 0.66% (95% CrI 0.39–1.33) and 0.6% (95% CI 0.2–1.3) based on data from China and the Diamond Princess ship, respectively [32, 33]. Several data and methodological differences can explain these differences, which we list in Table S2. For instance, Verity et al. conduct an age adjustment based on the data of age-stratified COVID-19 deaths from mainland China, assume a constant attack rate by age, and adjust for demographic structure. Our IFR estimates will be compared with estimates emerging from ongoing several mass serological studies. Our estimate of time–delay adjusted IFR of 0.12% (95% CrI 0.08–0.17%) falls in the range of mean IFR estimates based on 23 serological studies (0.02 to 0.78% (median 0.25%)) [34] and is consistent with an IFR estimate at 0.16% based on the 452 residents evacuated from Hubei [35], whereas it is smaller than an estimate of 0.35% based on a serological study conducted in Wuhan [36], which is probably slightly overestimated since its target population includes hospitalized patients. Yet, it is important that the external validity of these serological studies, e.g., whether the results can offer a representative picture of the population in the region where they are performed, is closely examined when the results are employed to derive estimates of the infection fatality ratio as discussed elsewhere [34, 36, 37]. Also, these findings indicate that the death risk in Wuhan is estimated to be much higher than those in other areas, which is likely explained by hospital-based transmission [38]. Indeed, past nosocomial outbreaks have been reported to elevate the CFR associated with MERS and SARS outbreaks, where inpatients that tend to be older and affected by underlying diseases have raised the CFR to values as high as 20% for a MERS outbreak [39, 40].

Public health authorities are interested in quantifying both R and CFR to measure the transmission potential and virulence of an infectious disease, especially when emerging/re-emerging epidemics occur in order to decide the intensity of the public health response. In the context of a substantial fraction of unobserved infections due to COVID-19, R estimates derived from the trajectory of infections and the IFR are more realistic indicators compared to estimates derived from observed cases alone [17, 41, 42].

Our analysis also revealed a high probability of occurrence and quite low reporting probabilities in Wuhan City. High probability of occurrence in the above equation suggests that zero observed cases at some point are not due to the absence of those infected, but more likely due to a low reporting rate. A very low reporting probability suggests that it is difficult to diagnose COVID-19 cases or a breakdown in medical care delivery. Moreover, we also identified a remarkable change in the reporting rate, estimated to be 14-fold lower in the 1st period (−Jan 16, 2020) and about the same during the 2nd period (January 17–22, 2020), relative to that estimated after January 23, 2020.

Our results are not free from the limitations. First, our methodology aims to capture the underlying transmission dynamics using multiple data sources. Implementing mass screening in certain populations is a useful approach to ascertain the real proportion of those infected and a way of adding credibility to the estimated values. Second, it is worth noting that the data of Japanese evacuees from Wuhan employed in our analysis is not a random sample from the Wuhan catchment population. Indeed, it is also plausible that their risk of infection in this sample is not as high as that of local residents in Wuhan, underestimating the fatality risk. Third, given the likely under-ascertainment of cases, there may also exist unreported deaths, and this might underestimate the death risk. Fourth, case fatality ratio (CFR) varies with age, gender, presence or absence of comorbidities, race, whether the healthcare system is overloaded or not, and other factors such as poverty risk, infant mortality risk, and the cumulative morbidity ratio [12, 43,44,45,46,47,48]. As CFR is influenced by reporting rate and ascertainment bias, subgroup analysis of IFR based on individual-level data is essential to capture the overall disease burden of COVID-19.

Conclusion

In summary, we have estimated key epidemiological parameters of the transmissibility and virulence of COVID-19 in Wuhan, China, January–February 2020 using an ecological modeling approach and several epidemiological datasets. The power of our approach lies in the ability to infer epidemiological parameters with quantified uncertainty from partial observations collected by surveillance systems.