Background

Emerging and re-emerging infectious diseases pose a continuing threat to human health. In the past decade we have faced global epidemics including SARS-coronavirus, pandemic influenza A(H1N1)pdm09 virus, avian influenza A(H5N1) virus, and most recently we have witnessed the emergence of influenza A(H7N9) virus in China, and the Middle-East-Respiratory-Syndrome (MERS)-coronavirus in the Middle East and Europe. Appropriate public health responses to infectious disease threats should be based on the best-available evidence, which in turn requires reliable data and appropriate analysis. In particular, risk assessments for A(H7N9) and MERS-coronavirus involve estimation and characterization of transmissibility and clinical severity [13].

Provided incidence of laboratory-confirmed cases is low, it is possible for health authorities to collect detailed data on each confirmed case in a ‘line list’. Analysis of this information can provide important insights into the epidemiology of a specific disease. A notable aspect of the recent epidemics of A(H7N9) and MERS-coronavirus is the amount of information about individual cases provided online, through official press releases and various media sources, to a much greater extent than, for example, during the A(H1N1) pandemic in 2009 to 2010 and the Severe Acute Respiratory Syndrome epidemic in 2003.

The influenza A(H7N9) virus emerged in early 2013 in China, and 143 laboratory-confirmed cases had been reported in mainland China by the end of 2013, with the majority of confirmed cases having illness onset during March and April 2013 [4]. The Chinese National Health and Family Planning Commission notified the World Health Organization in late March and joined forces for the prevention and control of the disease, along with other international animal health organizations [5]. Initiatives such as The Global Initiative on Sharing All Influenza Data (GISAID) have provided a framework for the sharing of full sequence data on virus genomes [6]. In the A(H7N9) epidemic, GISAID fostered several studies in early April, such as comparison of the A(H7N9) virus against Eurasian avian influenza viruses [7] and avian influenza A(H7N7) in the Netherlands [8]. There is no similar framework for the sharing of epidemiological data, although a number of unofficial line lists and repositories of epidemiologic data have been created based on publicly available data by automated digital surveillance algorithms or epidemiologists [9]. The objective of the present study was to investigate the extent to which reliable epidemiologic inferences could be made based on publicly available epidemiologic data, compared to the official data collected by Chinese health authorities on laboratory-confirmed cases of influenza A(H7N9) virus infection.

Methods

Ethical approval

It was determined by the Chinese National Health and Family Planning Commission that the collection of data from influenza A(H7N9) cases was part of a continuing public health investigation of an emerging outbreak and was exempt from institutional review board assessment.

Sources of data

A line list with detailed epidemiologic information on each laboratory-confirmed case of influenza A(H7N9) virus infection was constructed by the Chinese Center for Disease Control and Prevention (China CDC). Case definitions, surveillance for identification of A(H7N9) cases and A(H7N9) laboratory assays are described in a previous report [10]. Relevant epidemiological data on A(H7N9) cases were collected through interviews by trained staff. Data used in the present analyses include age, sex, geographic location (city and province), health status on admission, and dates of illness onset, hospital admission, death or discharge, for cases which were officially announced as of 31 May 2013, when the epidemic had stabilized.

In addition to the ‘official’ China CDC line list, we collated five other line lists that were constructed based on publicly available data. The five line lists were created by Harvard Medical School/Boston Children’s Hospital (‘HealthMap’), Virginia Polytechnic Institute and State University (‘Virginia Tech’), Bloomberg News (‘Bloomberg’), the University of Hong Kong School of Public Health (‘HKUSPH’), and FluTrackers [see Additional file 1: data file]. HealthMap is an automated disease surveillance system specializing in real-time geospatial visualization of disease outbreaks [11]. FluTrackers is an online forum which tracks and hosts discussions of a wide range of infectious diseases [12]. Virginia Bioinformatics Institute, Virginia Tech, and HKUSPH were staffed with a group of epidemiologists with interest in the modeling of infectious disease epidemics. Bloomberg news agency collated basic epidemiological data to assist with monitoring of the outbreak. Each line list was compiled based on reports of laboratory-confirmed influenza A(H7N9) cases released by, in the order of importance, the national and provincial Ministry of Health websites or microblogs, World Health Organization, international online disease reporting systems and online Chinese news or blogs [see Additional file 2: Table S1].

Statistical methods

We first conducted descriptive comparisons of the accuracy of individual variables in each line list compared to the China CDC version on various dates. Then we used line lists available at specific dates to estimate key epidemiologic parameters including the distributions of time from illness onset to hospitalization delay, time from illness onset to death, and time from onset to discharge, without adjusting for right-censoring which would require regular updates on patient status. Finally, we used the line lists available at specific dates to replicate real-time inferences on the hospitalization fatality risk (HFR) and the impact of closure of live poultry markets. We analyzed the line lists starting from 10 April 2013, when the number of confirmed A(H7N9) cases surpassed 30, until 31 May 2013. As the line lists were updated independently at different dates, for comparison purpose the dates of analyses were chosen to match the time of updates for most line lists.

To study inferences on clinical severity, we estimated the HFR [3] at specific calendar dates using two approaches. First, we divided the cumulative number of deaths by the cumulative number of hospitalized cases (HFR1), an approach which is certain to underestimate the hospitalization fatality risk because unresolved cases destined to die are included in the denominator but not the numerator [13, 14]. Second, we divided the cumulative number of deaths by the cumulative number of cases who had either died or been discharged (recovered). This approach (HFR2) should give an accurate real-time estimate of the HFR if the distribution of times from onset to death is similar to the distribution of times from onset to discharge, and the HFR does not change over calendar time [14].

To study inferences on transmissibility, we estimated the impact of closure of live poultry markets in Shanghai, Nanjing and Hangzhou using Poisson regression models that compared the incidence rates of confirmed A(H7N9) cases since the first case in each city versus the incidence rates after closures [15, 16]. We allowed for incubating infections by excluding a two-day ‘washout’ period immediately after market closures, with other washout periods considered in sensitivity analyses. We used multiple imputation with 20 replications for missing dates of illness onset in each dataset, based on the empirical onset to reporting distribution [17, 18]. All statistical analyses were conducted using R version 3.0.1 (R Foundation for Statistical Computing, Vienna, Austria).

Results

Age, sex, province and date of illness and death were collected for each influenza A(H7N9) case in all six line lists (Table 1). Current health status was also collected but only the China CDC, Virginia Tech and FluTrackers line lists had more detailed information on severity. Information was updated daily for China CDC and HealthMap while other line lists had more frequent updates at the beginning of the epidemic and less frequent updates when the epidemic tapered in early May. FluTrackers also updated their line list daily but was able to retrieve historical archives for the specific dates as listed in Table 1. More than 90% of the cases could be matched to the China CDC line lists by age, sex, province and date of illness onset [see Additional file 3: Figure S1]. While information on age, sex and province were mostly complete in different line lists, there were significant proportions of missing data on dates of hospitalization, discharge and health status. Death and discharge dates that were only available weeks after illness onset had a greater proportion of missing information [see Additional file 3: Figure S2]. For matched cases, we found discrepancies in dates of hospitalization, death and discharge when comparing to the China CDC line list [see Additional file 3: Figure S3].

Table 1 Summary of epidemiological information collected in each line list

We compared different epidemiological characteristics inferred from different line lists over time, for all cases irrespective of matching. The reported number of cases from the five line lists followed closely those reported by the China CDC line list, with less than one-day time-lag (Figure 1). The epidemic curves from the HealthMap, HKUSPH, Virginia Tech and FluTrackers line lists also resembled that from the China CDC line list at different time points [see Additional file 3: Figure S4], although some of the onset dates were missing or inaccurate. We estimated the onset to hospitalization distribution by a Gamma distribution, and onset to death and discharge distribution by Weibull distribution [4]. The estimated onset to hospitalization distributions on 1 May 2013 were generally similar (median ranged from 4.6 to 5.6 days) for all line lists (Figure 1). HealthMap, HKUSPH and Virginia Tech line lists were able to reflect the longer onset to death period for patients staying longer in hospital [see Additional file 3: Figure S5]. Information on discharge dates was only available in the Bloomberg and HKUSPH line lists, and in those datasets the estimated onset to discharge distributions were much shorter than the distribution based on the China CDC line list, with more missing discharge dates at the end of April [see Additional file 3: Figures S2 and S5]. We were able to obtain robust estimates for the onset to hospitalization distribution from each of the line lists early in the epidemic, but robust estimates of the onset to death distribution were not available until early May [see Additional file 2: Table S2].

Figure 1
figure 1

Epidemiological distributions based on analysis of line lists on 1 May 2013. (A) Number of laboratory-confirmed cases of influenza A(H7N9) virus infection, 10 April to 31 May, 2013. (B) onset-to-hospitalization distribution. (C) onset-to-death distribution. (D) onset-to-discharge distribution. Date of analysis refers to US local time for HealthMap, Virginia Tech and FluTrackers line lists, and China local time for China CDC, Bloomberg and HKUSPH line lists. China CDC, Chinese Center for Disease Control and Prevention; HKUSPH, the University of Hong Kong School of Public Health.

Figure 2 shows the estimated hospitalization fatality risk under the two different approaches. HFR1 estimates were consistently around 20% before May for all line lists and approached 35% afterwards. The five line lists consistently under-estimated HFR1 although the 95% confidence intervals covered the true estimate. As of 31 May, there were 18 patients with unresolved outcomes, including 16 patients with severe condition. The estimation of HFR2 required more detailed information (discharge status) and was only available for the China CDC and Bloomberg line lists. HFR2 decreased over time and stabilized at around 30% to 40% in early May. The Bloomberg estimates tended to be higher than the China CDC HFR2 with increasingly larger discrepancies over time. Only the HealthMap and FluTrackers line lists were able to provide more robust estimates of the fatality risk for hospitalized cases near the end of the study [see Additional file 2: Table S2].

Figure 2
figure 2

Estimated hospitalization fatality risks for laboratory-confirmed Influenza A(H7N9) cases, 10 April to 31 May, 2013. (A) HFR1 based on the number of deaths divided by the number of confirmed cases. (B) HFR2 based on the number of deaths divided by the number of confirmed cases with known outcome (death or discharge). HealthMap, Virginia Tech and HKUSPH did not routinely collect data on the number of discharged patients. The most updated estimate of the HFR [19] is shown by the gray lines. Vertical lines indicate the 95% confidence intervals. Date of analysis refers to US local time for HealthMap, Virginia Tech and FluTrackers line lists, and China local time for China CDC, Bloomberg and HKUSPH line lists. China CDC, Chinese Center for Disease Control and Prevention; HFR, hospitalization fatality risk; HKUSPH, the University of Hong Kong School of Public Health.

The epidemic curves in Shanghai and Hangzhou were very similar based on the China CDC, HealthMap, Virginia Tech and FluTrackers line lists where information on geographic location was available to the city level (Figure 3), athough there were some missing onset dates [see Additional file 3: Figure S2]. Live poultry market closures were implemented on 6 April, 8 April and 15 April in Shanghai, Nanjing and Hangzhou, respectively. Except for the FluTrackers line list where all onset dates after April were not available in Nanjing, market closures in all three cities were consistently estimated to be extremely effective in reducing A(H7N9) incidence rates (Table 2).

Figure 3
figure 3

Dates of illness onset of influenza A(H7N9) cases in Shanghai, Nanjing and Hangzhou. Dotted lines show the dates of live poultry market closure in each city. Patients with missing onset dates were excluded.

Table 2 Estimated effect of live poultry market closure in Shanghai, Nanjing and Hangzhou

Discussion

We examined which important epidemiological inferences could be drawn from publicly available information compared to official data from China CDC. We demonstrated that analyses mainly based on the reporting of A(H7N9) cases, deaths or their demographics, such as epidemic curves in different regions, estimated onset-to-admission distributions, onset-to-death distributions and impact of poultry market closure can very closely match the results from official data sources with little time-lag. However, estimates of the fatality risk for hospitalized cases were less reliable based on public information, where the estimation requires follow-up of patient status after hospitalization. For example, there was a tendency for online news to highlight the first discharged case in each province but there were fewer reports on subsequent discharged cases. This is the first study to rigorously test the reliability of publicly available data for epidemiological purposes and, although the assessment of clinical severity may be limited, it shows the assessment of transmissibility and geographical dispersion to be reliable. Our results concur with a recent study of information on confirmed cases reported to the World Health Organization in the 2009 influenza pandemic, which also identified difficulties in estimating severity from such datasets [20, 21].

The volume of online information about an epidemic is mostly driven by public interests and concern [22]. For an epidemic of a newly emerging or re-emerging disease, spread and severity of the diseases are of major public concern and, hence, information on case counts, severe or death cases are usually reported in more detail, especially when they are associated with a new location. In our study we also found that death dates were more frequently and accurately reported than discharge dates [see Additional file 3: Figure S2]. Information saturation also came into effect as the epidemic progressed [9], which may have resulted in decreasing accuracy and completeness of some variables. This is similar to the second wave of the influenza A(H1N1) pandemic during which there was disproportionately less media coverage even with a higher number of hospitalizations and deaths in some locations compared to the first wave [23].

In this study we did not attempt to estimate the incubation period, a potentially important epidemiological parameter for the control of disease transmission and for models of disease spread. The Virginia Tech line lists did collect information on occupational exposure, but more detailed individual information on poultry exposure was only available in the official line list. There was only limited information on poultry exposure for more severe cases in online news reports. Greater and more consistent details on the exposure history of individual cases, such as mode and different times of contact, are needed to allow robust analyses on the incubation period [4]. However, in a separate modeling study of the impact of live poultry market closures, we were able to obtain a reasonable estimate of the incubation period for A(H7N9) [15], and similar inference could be possible based on the publicly-available line lists.

There are several limitations in this study. The human influenza A(H7N9) epidemic in 2013 was mostly confined to the eastern part of China. Public data is likely to be less consistent, in terms of timeliness and accuracy, for diseases spreading across countries with different levels of healthcare resources, culture or local political environments. Secondly, duplicate reporting from different data sources may have inconsistent epidemiological information. National or international health organizations were regarded as most reliable but there were no well-defined rules for resolving inconsistencies. Thirdly, since current evidence shows that avian-to-human is the major transmission mode for influenza A(H7N9) [15, 24], our analyses may not be directly generalizable to diseases with human-to-human transmission, especially those with such relatively high transmissibility that the scale may overwhelm official health authorities as in the A(H1N1) pandemic in 2009 to 2010. Monitoring the evolving transmissibility of emerging influenza viruses is crucial, but requires fairly accurate information about the onset of symptoms of the cases in addition to reliable exposure history information, and the understanding of the transmission dynamics among poultry and from poultry to humans. For the line lists using publicly available data this information is very limited, thus hindering quantification of transmissibility in terms of the basic reproduction number. Finally, there are diverse purposes for compiling different line lists. For example, the main purpose of HealthMap is to generate early outbreak notifications and map disease occurrences. Hence, by design that line list placed less emphasis on health status after hospitalization. The goal and methods of data collection can influence their ultimate utility.

For the specific purpose of epidemiological inference, only a minimal dataset with standardized format and definition [25], along with regular follow-up of patient status, may improve data accuracy, completeness and timeliness over the course of an epidemic. This essential dataset may avoid a too demanding requirement on data completeness at the expense of sustainability or accuracy, and help in reaching a consensus on the amount of details to be disclosed while maintaining appropriate patient confidentiality even in a public health emergency. For the MERS epidemic, the national health authorities of the affected countries have released information at different times and sometimes with very limited resolution [26, 27], which would lead to challenges for any epidemiologist to unify all of the information into a single consistent database. In future emerging infectious disease outbreaks, depositing a line list into a database with agreed fields and hosted by a public platform, similar to the GISAID approach, and attaching corresponding time stamps and sources to each updated variable may also avoid confusion and improve accuracy.

Conclusions

In conclusion, we have reported types of epidemiological inferences that can be reliably drawn from public information, and major limitations for assessment of clinical severity of the disease. As for the ongoing MERS epidemic and the return of influenza A(H7N9) in winter 2013 to 2014 (more than 200 new cases have been confirmed since October 2013) [28], a well-constructed line list will foster joint efforts for more timely analyses with broader perspectives. Our findings illustrate the increasing potential value of digital epidemiology or infoepidemiology, based on novel sources of information, such as social media, microblogs and mobile phone applications [9, 29]. If publicly available information is sufficient to allow assessment of transmissibility and severity of emerging or reemerging infections [21, 30], it may even be possible to crowdsource the analytical processes and obtain essential inferences more rapidly [31].