# Assessing the impact of a health intervention via user-generated Internet content

- 4k Downloads
- 10 Citations

## Abstract

Assessing the effect of a health-oriented intervention by traditional epidemiological methods is commonly based only on population segments that use healthcare services. Here we introduce a complementary framework for evaluating the impact of a targeted intervention, such as a vaccination campaign against an infectious disease, through a statistical analysis of user-generated content submitted on web platforms. Using supervised learning, we derive a nonlinear regression model for estimating the prevalence of a health event in a population from Internet data. This model is applied to identify control location groups that correlate historically with the areas, where a specific intervention campaign has taken place. We then determine the impact of the intervention by inferring a projection of the disease rates that could have emerged in the absence of a campaign. Our case study focuses on the influenza vaccination program that was launched in England during the 2013/14 season, and our observations consist of millions of geo-located search queries to the Bing search engine and posts on Twitter. The impact estimates derived from the application of the proposed statistical framework support conventional assessments of the campaign.

## Keywords

Gaussian Process Infectious diseases Intervention Search query logs Social media Supervised learning User-generated content## 1 Introduction

Infectious diseases are a major concern for public health and a significant cause of death worldwide (Binder et al. 1999; Morens et al. 2004; Jones et al. 2008). Various health interventions, such as improved sanitation, clean water and immunization programs, assist in reducing the risk of infection (Cohen 2000). To monitor infectious diseases as well as evaluate the impact of control and prevention programs, health organizations have established a number of surveillance systems. Typically, these schemes, apart from requiring an established health system, only cover cases that result in healthcare service utilization. Therefore, they are not always able to capture the prevalence of a disease in the general population, where it is likely to be more common (Reed et al. 2009; Briand et al. 2011).

Recent research efforts have proposed various ways for taking advantage of online information to gain a better understanding of offline, real-world situations. Particular interest has been drawn on the modeling of user-generated web content, either in the form of social media text snippets or search engine query logs. Numerous works have provided statistical proof for the predictive capabilities of these resources with applications spreading across the domains of finance (Bollen et al. 2011), politics (O’Connor et al. 2010; Lampos et al. 2013) and healthcare (Ginsberg et al. 2009; Lampos and Cristianini 2010; Culotta 2010). Focusing on the domain of health, the development of models for nowcasting infectious diseases, such as influenza-like illness (ILI),^{1} has been a central theme (Milinovich et al. 2014). Initial indications that content from Yahoo’s (Polgreen et al. 2008) or Google’s (Ginsberg et al. 2009) search engine are good ILI indicators, were followed by a series of approaches using the microblogging platform of Twitter as an alternative, publicly available source (Lampos et al. 2010; Signorini et al. 2011; Lamb et al. 2013).

Tracking the prevalence of an infectious disease from Internet activities establishes a complementary and perhaps more sensitive sensor than doctor visits or hospitalizations because it provides access to the bottom of the disease pyramid, i.e., potential cases of infection many of whom may not use the healthcare system. Online data sources do have disadvantages, including noise and ambiguity, and respond not just to changes in disease prevalence, but also to other factors, especially media coverage (Cook et al. 2011; Lazer et al. 2014). Nevertheless, the learning approaches that convert this content to numeric indications about the rate of a disease aim to eliminate most of the aforementioned biases.

Areas participating in the LAIV program (*v*) and control areas (*c*) with their respective identifiers, population figures and geographical bounding box coordinates

Areas | id | Population | SW\({}^{\mathrm{a}}\) | NE\({}^{\mathrm{b}}\) |
---|---|---|---|---|

Bury | \(v_1\) | 186,527 | \(-\)2.352, 53.550 | \(-\)2.243, 53.645 |

Cumbria | \(v_2\) | 498,070 | \(-\)3.640, 54.042 | \(-\)2.159, 55.189 |

Gateshead | \(v_3\) | 199,998 | \(-\)1.662, 54.914 | \(-\)1.516, 54.971 |

Leicester City | \(v_{4a}\) | NA\({}^{\mathrm{c}}\) | \(-\)1.216, 52.581 | \(-\)1.046, 52.692 |

East Leicestershire | \(v_{4b}\) | 661,575\({}^{\mathrm{d}} \) | \(-\)0.891, 52.392 | \(-\)0.664, 52.978 |

Rutland | \(v_{4c}\) | 37,606 | \(-\)0.822, 52.525 | \(-\)0.428, 52.760 |

London, Havering | \(v_5\) | 242,080 | 0.138, 51.487 | 0.334, 51.632 |

London, Newham | \(v_6\) | 318,227 | \(-\)0.021, 51.498 | 0.098, 51.564 |

South East Essex | \(v_7\) | 175,798\({}^{\mathrm{e}}\) | 0.487, 51.494 | 1.032, 51.760 |

Brighton | \(c_1\) | 278,112\({}^{\mathrm{f}}\) | \(-\)0.174, 50.807 | \(-\)0.087, 50.870 |

Bristol | \(c_2\) | 437,492 | \(-\)3.118, 51.342 | \(-\)2.510, 51.544 |

Cambridge | \(c_3\) | 126,480 | 0.0774, 52.159 | 0.191, 52.238 |

Exeter | \(c_4\) | 121,800 | \(-\)3.687, 50.566 | \(-\)3.367, 50.886 |

Leeds | \(c_5\) | 761,481 | \(-\)1.800, 53.698 | \(-\)1.290, 53.946 |

Liverpool | \(c_6\) | 470,780 | \(-\)3.019, 53.312 | \(-\)2.818, 53.475 |

Norwich | \(c_7\) | 135,893 | 1.204, 52.555 | 1.541, 52.685 |

Nottingham | \(c_8\) | 310,837 | \(-\)1.247, 52.889 | \(-\)1.086, 53.019 |

Plymouth | \(c_9\) | 259,175 | \(-\)4.303, 50.211 | \(-\)3.983, 50.531 |

Sheffield | \(c_{10}\) | 560,085 | \(-\)1.801, 53.305 | \(-\)1.325, 53.503 |

Southampton | \(c_{11}\) | 242,141 | \(-\)1.564, 50.743 | \(-\)1.244, 51.063 |

York | \(c_{12}\) | 202,433 | \(-\)1.242, 53.799 | \(-\)0.922, 54.119 |

In this work, we extend previous ILI modeling approaches from Internet content and propose a statistical framework for assessing the impact of a health intervention. To validate our methodology, we used UK’s 2013/14 pilot LAIV campaign as a case study. Our experimental setup involved the processing of millions of Twitter postings and Bing search queries geo-located in the target vaccinated locations, as well as a broader set of control locations across England. Firstly, we assessed the predictive capacity of various text regression models for inferring ILI rates, proposing a nonlinear method for performing this task based on the framework of Gaussian Processes (Rasmussen and Nickisch 2010), which improved predictions on our data set by a degree greater than 22 % in terms of Mean Absolute Error (MAE) as compared to linear regularized regression methods such as the elastic-net (Zou and Hastie 2005). Then, we performed a statistical analysis, to evaluate the impact of the pilot LAIV program. The extracted impact estimates were in line with Public Health England’s (PHE)^{2} findings (Pebody et al. 2014), providing both supplementary support for the success of the intervention, and validatory evidence for our methodology.

## 2 Data sources

We used two user-generated data sources, namely search query logs from Microsoft’s Bing search engine and Twitter data. In the following paragraphs, we describe the process for extracting textual features from queries or tweets, as well as the additional components of the applied experimental process.

### 2.1 Feature extraction

We manually crafted a list of 36 textual markers (or *n*-grams) related to or expressing symptoms of ILI by browsing through related web pages (on Wikipedia or health-oriented websites). Then, using these markers as seeds, we extracted a set of frequent, co-occurring *n*-grams with \(n \le 4\), in a Twitter corpus of approx. 30 million tweets published between February and March 2014 and geo-located in the UK. This expanded the list of markers to a set of \(M = 205\) *n*-grams (see Supplementary Material, Table S1), which formed the feature space in our experimental process. Overall the number of *n*-grams does not reach the quantity explored in previous studies (Ginsberg et al. 2009; Lampos and Cristianini 2012), although this choice was motivated by the fact that a small set of keywords is adequate for achieving a good predictive performance when modeling ILI from user-generated content published online (Culotta 2013).

### 2.2 Geographic areas of interest

We analyzed data that was either geo-located in England as a whole or in specific areas within England. Table 1 lists all the specific locations of interest, dividing them into two categories: the 7 vaccinated areas (\(v_i\)) where the LAIV program was applied, and the selected 12 control areas (\(c_i\)) which represent urban centers in England, with considerable population figures, that were distant from all vaccinated areas, and were spread across the geography of the country, to the extent possible. Each area is specified by a geographical bounding box defined by the longitude and latitude of its South-West and North-East edge points.

### 2.3 User-generated web content

To perform a more rigorous experimental approach, distinct data sets from two different web sources have been compiled. The first (\(\mathcal {T}\)) consists of all Twitter posts (tweets) with geo-location enabled and pointing to the region of England from 02/05/2011 to 13/04/2014, i.e., 154 weeks in total. The total number of tweets involved is approx. 308 million, whereas the cumulative appearances of ILI-related *n*-grams is approx. 2.2 million. The vaccinated and control areas account for 5.8 and 12.6 % of the entire content respectively. The second data set (\(\mathcal {B}\)) consists of search queries on Microsoft’s web search engine, Bing, from 31/12/2012 to 13/04/2014 (67 weeks in total), geo-located in England. This data set has smaller temporal coverage as compared to Twitter data due to limitations in acquiring past search query logs. The number of queries in \(\mathcal {B}\) is significantly larger than the number of tweets in \(\mathcal {T}\) ^{3}; 3.75 % of the queries were geo-located in vaccinated areas, 12.53 % in control areas, and flu related *n*-grams appeared in approx. 7.7 million queries. For all the considered *n*-grams (Supplementary Material, Table S1) we extracted their weekly frequency in England as well as in the designated areas of interest. We performed a more relaxed search, looking for content (tweets or search queries) that contains all the 1-gram blocks of an *n*-gram.

### 2.4 Official health reports

^{4}in the UK. The estimates represent the number of GP consultations identified as ILI per 100 people for the geographical region of England and their temporal resolution is weekly (Fig. 1).

## 3 Estimating the impact of a healthcare intervention

The proposed methodology consists of two main steps: (a) the modeling and prediction of a disease rate proxy from user-generated content as a regression problem, and (b) the assessment of the health campaign using a statistical scheme that incorporates the regression models for the disease. Among well studied linear functions for text regression, we also propose a nonlinear technique, where different *n*-gram categories (sets of keywords of size *n*) are captured by a different kernel function, as a better performing alternative (see Sects. 3.1 and 3.2). The statistical framework for computing the impact of the intervention program is based on a method for evaluating the impact of printed advertisements (Lambert and Pregibon 2008); the method is described in detail in Sect. 3.3.

### 3.1 Linear regression models for disease rate prediction

In this supervised learning setting, our observations \(\mathbf {X}\) consist of *n*-gram frequencies across time and the responses \(\mathbf {y}\) are formed by official health reports, both focused on a particular geographical region. Using *N* weekly time intervals and the *M* *n*-gram features, \(\mathbf {X} \in \mathbb {R}^{N \times M}\) and \(\mathbf {y} \in \mathbb {R}^{N}\). Each row of \(\mathbf {X}\) holds the normalized *n*-gram frequencies for a week in our data set. Normalization is performed by dividing the number of *n*-gram occurrences with the total number of tweets or search queries in the corpus for that week. Previous work performing text regression on social media content suggested the use of regularized linear regression schemes (Lampos and Cristianini 2010; Lampos et al. 2010). Here, we employ two well-studied regularization techniques, namely ridge regression (Hoerl and Kennard 1970) and the elastic-net (Zou and Hastie 2005), to obtain baseline performance rates.

*i*. The regularization of \(\mathbf {w}\) assists in resolving singularities which lead to ill-posed solutions when applying OLS. Broadly applied solutions suggest the penalization of either the L2 norm (ridge regression) or the L1 norm (lasso) of \(\mathbf {w}\). Ridge regression (Hoerl and Kennard 1970) is formulated as

*n*-gram frequencies and semantically related

*n*-grams will exhibit a degree of correlation. This is resolved by the elastic-net (Zou and Hastie 2005), an optimization function which merges L1 and L2 norm regularization, maintaining both positive properties of lasso and ridge regression. It is formulated as

### 3.2 Disease rate prediction using Gaussian processes

While the majority of methods for modeling infectious diseases are based on linear solvers (Ginsberg et al. 2009; Lampos et al. 2010; Culotta 2010), there is some evidence that nonlinear methods may be more suitable, especially when features are based on different *n*-gram lengths (Lampos 2012). Furthermore, recent studies in natural language processing (NLP) indicate that the usage of nonlinear methods, such as Gaussian Processes (GPs), in machine translation or text regression tasks improves performance, especially in cases where the feature space is not large (Lampos et al. 2014; Cohn et al. 2014). Motivated by these findings, we also considered a nonlinear model for disease prediction formed by a composite GP.

*f*: \(\mathbb {R}^M \rightarrow \mathbb {R}\) that is drawn from a \(\mathcal {GP}\) prior

^{5}integration, i.e.,

*n*-gram categories (1-grams, 2-grams, etc.) with a different RQ kernel. The reasoning behind this is the assumption that different

*n*-gram categories may have varied usage patterns, requiring different parametrization for a proper modeling. Also as

*n*increases, the

*n*-gram categories are expected to have an increasing semantic value. The final covariance function, therefore, becomes

*n*-gram category, i.e., \(\mathbf {x} =\) {\(\mathbf {g}_1\), \(\mathbf {g}_2\), \(\mathbf {g}_3\), \(\mathbf {g}_4\)},

*C*is equal to the number of

*n*-gram categories (in our experiments, \(C = 4\)) and \(k_{\hbox {N}}(\mathbf {x},\mathbf {x}') = \sigma _{\hbox {N}}^2 \times \delta (\mathbf {x},\mathbf {x}')\) models noise (\(\delta \) being a Kronecker delta function). The summation of RQ kernels which are based on different sets of features can be seen as an exploration of the first order interactions of these feature families; more elaborate combinations of features could be studied by applying different types of covariance functions (e.g., Matérn 1986) or an additive kernel (Duvenaud et al. 2011). An extended examination of these and other models is beyond the scope of this work.

### 3.3 Intervention impact assessment

Conventional epidemiology typically assesses the impact of a healthcare intervention, such as a vaccination program, by comparing population disease rates in the affected (target) areas to the ones in non participating (control) areas (Pebody et al. 2014). However, a direct comparison of target and control areas may not always be applicable as comparable locations would need to be represented by very similar properties, such as geography, demographics and healthcare coverage. Identifying and quantifying such underlying characteristics is not something that is always possible or can be resolved in a straightforward manner. We, therefore, determine the control areas empirically, but in an automatic manner, as discussed below.

Firstly, we compute disease estimates (\(\mathbf {q}\)) for all areas using our input observations (social media and search query data) and a text regression model. Ideally, for a target area *v* we wish to compare the disease rates during (and slightly after) the intervention program (\(\mathbf {q}_v\)) with disease rates that would have occurred, had the program not taken place (\(\mathbf {q}_{v}^{*}\)). Of course, the latter information, \(\mathbf {q}_{v}^{*}\), cannot be observed, only estimated. To do so, we adopt a methodology proposed for addressing a related task, i.e., measuring the effectiveness of offline (printed) advertisements using online information (Lambert and Pregibon 2008).

Consider a situation where, prior to the commencement of the intervention program, there exists a strong linear correlation between the estimated disease rates of areas that participate in the program (*v*) and of areas that do not (*c*). Then, we can learn a linear model that estimates the disease rates in *v* based on the disease rates in *c*. Hypothesizing that the geographical heterogeneity encapsulated in this relationship does not change during and after the campaign, we can subsequently use this model to estimate disease rates in the affected areas in the absence of an intervention (\(\mathbf {q}_{v}^{*}\)).

*c*for a period of \(\tau = \{t_1,..,t_N\}\) days before the beginning of the intervention (\(\mathbf {q}^\tau _{c}\)) have a strong Pearson correlation, \(r(\mathbf {q}^\tau _{v},\mathbf {q}^\tau _{c})\), with the respective inferred rates in a target area

*v*(\(\mathbf {q}^\tau _{v}\)). If this is true, then we can learn a linear function \(f(w,\beta ): \mathbb {R} \rightarrow \mathbb {R}\) that will map \(\mathbf {q}^\tau _{c}\) to \(\mathbf {q}^\tau _{v}\):

*N*replications of the bias term (\(\beta \)).

Confidence intervals (CIs) for these metrics can be derived via bootstrap sampling (Efron and Tibshirani 1994). By sampling with replacement the regression’s residuals \(\mathbf {q}^\tau _c - \hat{\mathbf {q}}^\tau _c\) in Eq. 10 (where \(\hat{\mathbf {q}}^\tau _c\) is the fit of the training data \(\mathbf {q}^\tau _v\)) and then adding them back to \(\hat{\mathbf {q}}^\tau _c\), we create bootstrapped estimates for the mapping function \(f(\dot{w},\dot{\beta })\). We additionally sample with replacement \(\mathbf {q}_{v}\) and \(\mathbf {q}_{c}\), before applying the bootstrapped function on them. This process is repeated 100,000 times and an equivalent number of estimates for \(\delta _{v}\) and \(\theta _{v}\) is computed. The CIs are derived by the .025 and .975 quantiles in the distribution of those estimates. Provided that the distribution of the bootstrap estimates is unimodal and symmetric, we assess an outcome as statistically significant, if its absolute value is higher than two standard deviations of the bootstrap estimates (similarly to Lambert and Pregibon 2008).

## 4 Results

In the following sections, we apply the previously described framework to assess the UK’s pilot school children LAIV campaign based on user-generated Internet data. First, we evaluate the aforementioned regression methods that provide a proxy for ILI via the modeling of Bing and Twitter content geo-located in England. As ‘ground truth’ in these experiments, we use ILI rates (see Fig. 1) published by the RCGP/PHE. We then use the best performing regression model in the framework for estimating the impact of the vaccination campaign.

### 4.1 Predictive performance for ILI inference methods

*r*), which is not always indicative of the prediction accuracy, and the MAE between predictions (\(\hat{\mathbf {y}}\)) and ‘ground truth’ (\(\mathbf {y}\)). For

*N*predictions of a single fold, MAE is defined as

*r*and MAE on the 10 folds are computed together with their corresponding standard deviations.

Given that the extracted tweets had a more extended temporal coverage compared to the search queries, we have performed experiments on the following data sets: (a) Twitter data for the period \(\varDelta \hbox {t}_1 = 154\) weeks, from 02/05/2011 to 13/04/2014, a time period that encompasses three influenza seasons, (b) search query log data from Bing for the period \(\varDelta \hbox {t}_3 = 67\) weeks, from 31/12/2012 to 13/04/2014, and (c) Twitter data for the same period \(\varDelta \hbox {t}_3\). All data sets are considering content geo-located in England and the respective time periods are depicted on Fig. 1. The latter data set (c) permits a better comparison between Twitter and Bing data.

*t*test (\(p = .0471\)); this statistically significant difference is replicated in all experiments (\(p < .005\)) indicating that the GP model handles the ILI inference task better. Bing data provide a better inference performance as compared to Twitter data from the same time period (\(\mu (r) = .952\), \(\mu (\hbox {MAE}) = 1.598\times 10^{-3}\)), but in that case the difference in performance between the two sources is not statistically significant at the 5 % level (\(p = .1876\)). The usefulness of incorporating different

*n*-gram categories and not just 1-grams has also been empirically verified (see Appendix 2, Table 5). Experiments, where Bing and Twitter data were combined (by feature aggregation or different kernels), indicated a small performance drop. However, this cannot form a generalized conclusion as it may be a side effect of the data properties (format, time-span) we were able to work with. We leave the exploration of more advanced data combinations for future work.

Performance of ILI estimators for England under all investigated models and data sets (\(\mathcal {T}\): Twitter, \(\mathcal {B}\): Bing) based on a 10-fold cross validation

Ridge regression | Elastic-Net | GP-kernel | ||||
---|---|---|---|---|---|---|

\(\mu (r)\) | \(\mu \)(MAE)\( \, \times \,10^3\) | \(\mu (r)\) | \(\mu \)(MAE)\( \, \times \,10^3\) | \(\mu (r)\) | \(\mu \)(MAE)\( \,\times \,10^3\) | |

\(\mathcal {T}\), \(\varDelta \hbox {t}_1\) | .640 (.112) | 3.074 (.497) | .718 (.206) | 2.828 (.809)\(^{*}\) | .845 (.062) | 2.196 (.477)\(^{*}\) |

\(\mathcal {T}\), \(\varDelta \hbox {t}_3\) | .698 (.181) | 4.084 (.879) | .744 (.137) | 3.198 (.137)\(^{*}\) | .924 (.053) | 1.999 (.763)\(^{*,\dagger }\) |

\(\mathcal {B}\), \(\varDelta \hbox {t}_3\) | .814 (.103) | 2.963 (.638) | .867 (.067) | 2.564 (.677)\(^{*}\) | . | |

### 4.2 Assessing the impact of the LAIV campaign

Taking into account the results presented in the previous section, we rely on the best performing GP-kernel model for estimating an ILI proxy. For both Twitter and Bing, we have used ILI models trained on all data geo-located in England (time frames \(\varDelta \hbox {t}_1\) and \(\varDelta \hbox {t}_3\) apply respectively). After learning a generic model for England, we then use it to infer ILI rates in specific locations.^{6}

To assess the impact of the LAIV campaign, we first need to identify control areas with estimated ILI rates that are strongly correlated to rates in the target vaccinated locations before the start of the LAIV program (Table 1 lists all the considered areas). As the strains of influenza virus may vary between distant time periods (Smith et al. 2004), invalidating our hypothesis for geographical homogeneity across the considered flu seasons, we look for correlated areas in a pre-vaccination period that includes the previous flu season only (2012/2013). For Twitter data, this is from June, 2012 to August, 2013 (all inclusive), whereas for Bing data, given their smaller temporal coverage, the period was from January to August, 2013 (all inclusive). To determine the best control areas, an exhaustive search is performed comparing the correlation between vaccinated and control areas, for all individual areas and supersets of them.

Statistically significant estimates of the LAIV program’s impact on the vaccinated areas using Twitter (\(\mathcal {T}\)) or Bing (\(\mathcal {B}\)) data

Data | Targets ( | Controls ( | | \(\delta _v \times 10^3\) | \(\theta _v\) (%) |
---|---|---|---|---|---|

\(\mathcal {T}\) | all | \(c_1-c_3\), \(c_5-c_8\), \(c_{10}\) | . | \(-\) | \(-\) |

\(\mathcal {T}\) | \(v_5,v_6\) | \(c_1-c_4\), \(c_6\), \(c_7\), \(c_{12}\) | .738 | \(-\)1.727 (\(-\)2.523, \(-\)0.942) | \(-\)30.453 (\(-\)41.751, \(-\)17.516) |

\(\mathcal {T}\) | \(v_2\) | \(c_1\), \(c_3\), \(c_4\), \(c_7-c_9\), \(c_{11}\) | .769 | \(-\)1.181 (\(-\)2.274, \(-\)0.094) | \(-\)21.060 (\(-\)37.136, \(-\)1.821) |

\(\mathcal {T}\) | \(v_6\) | \(c_1\), \(c_3\), \(c_4\), \(c_6\) | .738 | \(-\)1.633 (\(-\)2.782, \(-\)0.521) | \(-\)30.436 (\(-\)46.742, \(-\)10.627) |

\(\mathcal {B}\) | all | \(c_1\), \(c_2\), \(c_4-c_7\), \(c_{11}\) | . | \(-\) | \(-\) |

\(\mathcal {B}\) | \(v_5,v_6\) | \(c_4-c_7\), \(c_{11}\) | .848 | \(-\)2.811 (\(-\)4.073, \(-\)1.566) | \(-\)28.372 (\(-\)36.717, \(-\)17.943) |

\(\mathcal {B}\) | \(v_3\) | \(c_7\) | .618 | \(-\)3.737 (\(-\)6.908, \(-\)0.878) | \(-\)30.246 (\(-\)44.624, \(-\)9.174) |

The time period used for evaluating the LAIV program includes the weeks starting from 30/09/2013 and ending at 13/04/2014 (28 weeks in total), i.e., the time frame covering the actual campaign (up to January, 2014) plus the weeks up until the end of the flu season (see Fig. 1). The bootstrap estimates for both impact metrics (\(\delta _t\) and \(\theta _t\)) provide confidence intervals as well as a measure for testing the statistical significance of an outcome. Given that the distribution of the bootstrap estimates appears to be unimodal and symmetric (see Appendix 2, Fig. 4), an outcome is considered as statistically significant, if it is smaller than two standard deviations of the bootstrap sample. The statistically significant impact estimates (Table 3) indicate a reduction of ILI rates, with impact percentages ranging from \(-21.06\) % to \(-32.77\) %. Interestingly, the estimated impact for the London areas is in a similar range for both Bing and Twitter data (\(-28.37\) % to \(-30.45\) %).

### 4.3 Sensitivity of impact estimates

Sensitivity assessment of LAIV campaign’s impact estimates (cases are aligned with Table 3)

Data set | Targets | # Controls | \(\mu \left( r (v,c)\right) \) | \(\mu \left( \delta _v\right) \times 10^3\) | \(\mu \left( \theta _v\right) \) (%) | \(\varDelta \theta _v\) (%) |
---|---|---|---|---|---|---|

\(\mathcal {T}\) | all | 100 | .841 (0.007) | \(-\)2.506 (0.234) | \(-\)32.740 (2.066) | |

\(\mathcal {T}\) | \(v_5,v_6\) | 79 | .703 (0.011) | \(-\)1.532 (0.148) | \(-\)27.918 (1.955) | 8.32 |

\(\mathcal {T}\) | \(v_2\) | 8 | .744 (0.015) | \(-\)1.236 (0.111) | \(-\)21.793 (1.516) | |

\(\mathcal {T}\) | \(v_6\) | 32 | .705 (0.013) | \(-\)1.340 (0.218) | \(-\)26.277 (3.149) | 13.66 |

\(\mathcal {B}\) | all | 46 | .854 (0.003) | \(-\)1.382 (0.369) | \(-\)16.417 (3.590) | 24.36 |

\(\mathcal {B}\) | \(v_5,v_6\) | 100 | .841 (0.002) | \(-\)1.448 (0.212) | \(-\)16.899 (1.827) | 40.44 |

\(\mathcal {B}\) | \(v_3\) | 2 | .607 (0.016) | \(-\)3.229 (0.719) | \(-\)27.120 (4.421) | 10.34 |

## 5 Related work

User-generated web content has been used to model infectious diseases, such as influenza-like illness (Milinovich et al. 2014). Coined as “infodemiology” (Eysenbach 2006), this research paradigm has been first applied on queries to the Yahoo engine (Polgreen et al. 2008). It became broadly known, after the launch of the Google Flu Trends (GFT) platform (Ginsberg et al. 2009). Both modeling attempts used simple variations of linear regression between the frequency of specific keywords (e.g., ‘flu’) or complete search queries (e.g., ‘how to reduce fever’) and ILI rates reported by syndromic surveillance. In the latter case, the feature selection process, i.e., deciding which queries to include in the predictive model, was based on a correlation analysis between query frequency and published ILI rates (Ginsberg et al. 2009). However, GFT has been criticized as in several occasions its publicly available outputs exhibited significant deviation from the official ILI rate reports (Cook et al. 2011; Olson et al. 2013; Lazer et al. 2014).

Research has also considered content coming from the social platform of Twitter as a publicly available alternative to access user-generated information. Regression models, either regularized (Lampos and Cristianini 2010; Lampos et al. 2010) or based on a smaller set of features (Culotta 2010), were used to infer ILI rates. Qualitative properties of the H1N1 pandemic in 2009 have been investigated through an analysis of tweets containing specific keywords (Chew and Eysenbach 2010) as well as a more generic modeling (Signorini et al. 2011); in the latter work support vector regression (Cristianini and Shawe-Taylor 2000) was used to estimate ILI rates. Bootstrapped regularized regression (Bach 2008) has been applied to make the feature selection process more robust (Lampos and Cristianini 2012); the same method has been applied to infer rainfall rates from tweets, indicating some generalization capabilities of those techniques. Furthermore, proof has been provided that for Twitter content a small set of keywords can provide an adequate prediction performance (Culotta 2013). Other studies, focused on unsupervised models that applied NLP methods in order to identify disease oriented tweets (Lamb et al. 2013) or automatically extract health concepts (Paul and Dredze 2014).

In this paper, we base our ILI modeling on previous findings, but apart from relying on a linear model, we also investigate the performance of a nonlinear multi-kernel GP (Rasmussen and Williams 2006). GPs have been applied in a number of fields, ranging from geography (Oliver and Webster 1990) to sports analytics (Miller et al. 2014). Recently, they were also used—as a better performing alternative—in NLP tasks such as the annotation modeling for machine translation (Cohn and Specia 2013), text regression (Lampos et al. 2014), and text classification (Preoţiuc-Pietro et al. 2015), where various multi-modal features were combined in one learning function. To the best of our knowledge, there has been no previous work aiming to model the impact of a health intervention through user-generated online content. This evaluation is usually conducted by an analysis of the various epidemiological surveillance outputs, if they are available (Pebody et al. 2014; Matsubara et al. 2014). The core methodology (and its statistical properties) on which we based our impact analysis has been proposed by Lambert and Pregibon (2008).

## 6 Discussion

We presented a statistical framework for transforming user-generated content published on web platforms to an assessment of the impact of a health-oriented intervention. As an intermediate step, we proposed a kernelized nonlinear GP regression model for learning disease rates from *n*-gram features. Assuming that an ILI model trained on a national level represents sufficiently smaller parts of the country, we used it as our ILI scoring tool throughout our experiments. Focusing on the theme of influenza vaccinations (Osterholm et al. 2012; Baguelin et al. 2012), especially after the H1N1 epidemic in 2009 (Smith et al. 2009), we measured the impact of a pilot primary school LAIV program introduced in England during the 2013/14 flu season. Our experimental results are in concordance with independent findings from traditional influenza surveillance measurements (Pebody et al. 2014). The derived vaccination impact assessments resulted in percentages (per vaccinated area or cumulatively) ranging from \(-21.06\) to \(-32.77\) % based on the two data sources available.

The results from Twitter data, however, demonstrated less sensitivity across similar controls as compared to Bing data, suggesting a greater reliability. To that end, the most reliable impact estimate from the processed tweets regarded an aggregation of all vaccinated locations and was equal to \(-32.77\) %. PHE’s own impact estimates looked at various end-points, comparing vaccinated to all non vaccinated areas, and ranged from \(-66\) % based on sentinel surveillance ILI data to \(-24\) % using laboratory confirmed influenza hospitalizations; albeit, these numbers represent different levels of severity or sensitivity, and notably none of these computations yielded statistical significance (Pebody et al. 2014). Thus, we cannot use them as a directly comparable metric, but mostly as a qualitative indication that the vaccination campaign is likely to have been effective.

A legitimate question is whether our analysis can yield one number that quantifies the intervention’s impact. This is a difficult undertaking given that no definite ground truth exists to allow for a proper verification. In addition, our estimations are based on models trained on syndromic surveillance data, which themselves may lack some specificity, hence not forming a solid gold standard. Interestingly, for the three distinct areas, where our method delivered statistically significant impact estimates based on Twitter data, i.e., Havering (\(-41.21\) %; see Appendix 2, Table 6), Newham (\(-30.44\) %) and Cumbria (\(-21.06\) %), there exists a clear analogy with the reported level of vaccine uptake—63.8, 45.6 and 35.8 % respectively—as published by PHE (Pebody et al. 2014); a similar pattern is evident in the Bing data. This observation provides further support for the applied methodology.

Understanding the properties of the underlying population behind each disease surveillance metric is instrumental. First of all, the demographics (age, social class) of people who use a social media tool, a web search engine, or visit healthcare facilities may vary. For example, we know that 51 % of the UK-based Twitter users are relatively young (15–34 years old), whereas only an 11 % of them is 55 years or older (Ipsos MORI 2014). On the other hand, non-adults or the elderly are often responsible for the majority of doctor visits or hospital admissions (O’Hara and Caswell 2012). In addition, the relative volume of the aforementioned inputs also varies. We estimate that Twitter users in our experiments represent at most 0.24 % of the UK population, whereas Bing has a larger penetration (approx. 4.2 %; see Appendix 1 for details). On the other side, in an effort to draw a comparable statistic, a 5-year study (2006–2011) on a household-level community cohort in England indicated that only 17 % of the people with confirmed influenza are medically attended (Hayward et al. 2014). An other study estimated that 7500 (0.01 %) hospitalizations occurred due to the second and strongest wave of the 2009 H1N1 pandemic in England, when the percentage of the population being symptomatic was approx. 2.7 % (Presanis et al. 2011). It is, therefore, a valid activity to seek complementary ways, sensors or population samples for quantifying infectious diseases or the success of a healthcare intervention campaign.

Our method accesses a different segment of the population compared to traditional surveillance schemes, given that Internet users provide a potentially larger sample compared to the people seeking medical attention. The caveat is that user-generated content will be more noisy, thus, less reliable compared to doctor reports, and that it will entail certain biases. However, it can be advantageous, when data from traditional epidemiological sources are sparse, e.g., due to a mild influenza season, but also useful in other settings, where either traditional surveillance schemes are not well established or a more geographically focused signal is required. Despite the fact that our case study focuses on influenza, the proposed framework can potentially be adapted for estimating the impact of different health intervention scenarios. Future work should be focused on improving the various components of such frameworks as well as in the design of experimental settings that can provide a more rigorous evaluation ability.

## Footnotes

- 1.
- 2.
PHE is an executive agency for the Department of Health in England.

- 3.
The exact number cannot be disclosed as this is sensitive product information.

- 4.
RCGP has an established sentinel network of general practitioners in England and together with PHE publishes ILI rates on a weekly basis. Summaries of surveillance reports can be found at http://www.gov.uk/sources-of-uk-flu-data-influenza-surveillance-in-the-uk (accessed May 31, 2015).

- 5.
Note that it is not strictly Bayesian in the sense that no prior is assumed for each one of the hyper-parameters in the \(\mathcal {GP}\) function.

- 6.
This decision is also enforced by the lack of ground truth for specific locations.

## Notes

### Acknowledgments

This work has been supported by the EPSRC Grant EP/K031953/1 (“Early-Warning Sensing Systems for Infectious Diseases”). The authors would also like to acknowledge the Royal College of General Practitioners in the UK (in particular Simon de Lusignan) and Public Health England for providing ILI surveillance data.

## Supplementary material

## References

- Bach FR (2008) Bolasso: model consistent lasso estimation through the bootstrap. In: Proceedings of the 25th International Conference on Machine Learning, pp 33–40Google Scholar
- Baguelin M, Jit M, Miller E, Edmunds WJ (2012) Health and economic impact of the seasonal influenza vaccination programme in England. Vaccine 30(23):3459–3462CrossRefGoogle Scholar
- Binder S, Levitt AM, Sacks JJ, Hughes JM (1999) Emerging infectious diseases: public health issues for the 21st Century. Science 284(5418):1311–1313CrossRefGoogle Scholar
- Boivin G, Hardy I, Tellier G, Maziade J (2000) Predicting influenza infections during epidemics with use of a clinical case definition. Clin Infect Dis 31(5):1166–1169CrossRefGoogle Scholar
- Bollen J, Mao H, Zeng X (2011) Twitter mood predicts the stock market. J Comput Sci 2(1):1–8CrossRefGoogle Scholar
- Briand S, Mounts A, Chamberland M (2011) Challenges of global surveillance during an influenza pandemic. Public Health 125(5):247–256CrossRefGoogle Scholar
- Chew C, Eysenbach G (2010) Pandemics in the age of Twitter: content analysis of tweets during the 2009 H1N1 outbreak. PLoS ONE 5(11):e14118CrossRefGoogle Scholar
- Cohen ML (2000) Changing patterns of infectious disease. Nature 406(6797):762–767CrossRefGoogle Scholar
- Cohn T, Specia L (2013) Modelling annotator bias with multi-task gaussian processes: an application to machine translation quality estimation. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp 32–42Google Scholar
- Cohn T, Preoţiuc-Pietro D, Lawrence N (2014) Gaussian processes for natural language processing. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Tutorials, pp 1–3Google Scholar
- Cook S, Conrad C, Fowlkes AL, Mohebbi MH (2011) Assessing Google flu trends performance in the United States during the 2009 influenza virus A (H1N1) pandemic. PLoS ONE 6(8):e23610CrossRefGoogle Scholar
- Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other Kernel-based learning methods. Cambridge University Press, CambridgeCrossRefGoogle Scholar
- Culotta A (2010) Towards detecting influenza epidemics by analyzing twitter messages. In: Proceedings of the 1st Workshop on Social Media Analytics, pp 115–122Google Scholar
- Culotta A (2013) Lightweight methods to estimate influenza rates and alcohol sales volume from Twitter messages. Lang Resour Eval 47(1):217–238CrossRefGoogle Scholar
- Duvenaud DK, Nickisch H, Rasmussen CE (2011) Additive Gaussian processes. Adv Neural Inf Process Syst 24:226–234Google Scholar
- Efron B, Tibshirani RJ (1994) An introduction to the bootstrap. CRC Press, Boca RatonGoogle Scholar
- Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499MathSciNetCrossRefMATHGoogle Scholar
- Eysenbach G (2006) Infodemiology: tracking flu-related searches on the web for syndromic surveillance. In: AMIA Annual Symposium Proceedings, pp 244–248Google Scholar
- Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L (2009) Detecting influenza epidemics using search engine query data. Nature 457(7232):1012–1014CrossRefGoogle Scholar
- Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Springer, New YorkCrossRefMATHGoogle Scholar
- Hayward AC, Fragaszy EB, Bermingham A, Wang L, Copas A, Edmunds WJ et al (2014) Comparative community burden and severity of seasonal and pandemic influenza: results of the Flu Watch cohort study. Lancet Respir Med 2(6):445–454CrossRefGoogle Scholar
- Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12:55–67CrossRefMATHGoogle Scholar
- Ipsos MORI (2014) MediaCT Tech Tracker Q1. Technical ReportGoogle Scholar
- Jones KE, Patel NG, Levy MA, Storeygard A, Balk D et al (2008) Global trends in emerging infectious diseases. Nature 451(7181):990–993CrossRefGoogle Scholar
- Lamb A, Paul MJ, Dredze M (2013) Separating fact from fear: tracking flu infections on Twitter. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics—Human Language Technologies, pp 789–795Google Scholar
- Lambert D, Pregibon D (2008) online effects of offline ads. In: Proceedings of the 2nd International Workshop on Data Mining and Audience Intelligence for Advertising, pp 10–17Google Scholar
- Lampos V (2012) Detecting events and patterns in large-scale user generated textual streams with statistical learning methods. Ph.D. Thesis, University of Bristol, BristolGoogle Scholar
- Lampos V, Cristianini N (2010) Tracking the flu pandemic by monitoring the Social Web. In: Proceedings of the 2nd International Workshop on Cognitive Information Processing, pp 411–416Google Scholar
- Lampos V, Cristianini N (2012) Nowcasting events from the social web with statistical learning. ACM Trans Intell Syst Technol 3(4):72:1–72:22CrossRefGoogle Scholar
- Lampos V, De Bie T, Cristianini N (2010) Flu detector: tracking epidemics on Twitter. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, pp 599–602Google Scholar
- Lampos V, Preoţiuc-Pietro D, Cohn T (2013) A user-centric model of voting intention from Social Media. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp 993–1003Google Scholar
- Lampos V, Aletras N, Preoţiuc-Pietro D, Cohn T (2014) Predicting and Characterising User Impact on Twitter. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp 405–413Google Scholar
- Lazer D, Kennedy R, King G, Vespignani A (2014) The parable of Google flu: traps in big data analysis. Science 343(6176):1203–1205CrossRefGoogle Scholar
- Leetaru K, Wang S, Cao G, Padmanabhan A, Shook E (2013) Mapping the global Twitter heartbeat: the geography of Twitter. First Monday 18(5). doi: 10.5210/fm.v18i5.4366
- Matérn B (1986) Spatial variation. Springer, BerlinCrossRefMATHGoogle Scholar
- Matsubara Y, Sakurai Y, van Panhuis WG, Faloutsos C (2014) FUNNEL: Automatic Mining of Spatially Coevolving Epidemics. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 105–114Google Scholar
- Milinovich GJ, Williams GM, Clements ACA, Hu W (2014) Internet-based surveillance systems for monitoring emerging infectious diseases. Lancet Infect Dis 14(2):160–168CrossRefGoogle Scholar
- Miller A, Bornn L, Adams R, Goldsberry K (2014) Factorized point process intensities: a spatial analysis of professional basketball. In: Proceedings of the 31th International Conference on Machine Learning, pp 235–243Google Scholar
- Monto A, Gravenstein S, Elliott M, Colopy M, Schweinle J (2000) Clinical signs and symptoms predicting influenza infection. Arch Intern Med 160(21):3243–3247CrossRefGoogle Scholar
- Morens DM, Folkers GK, Fauci AS (2004) The challenge of emerging and re-emerging infectious diseases. Nature 430(6996):242–249CrossRefGoogle Scholar
- O’Connor B, Balasubramanyan R, Routledge BR, Smith NA (2010) From Tweets to polls: linking text sentiment to public opinion time series. In: Proceedings of the 4th International AAAI Conference on Weblogs and Social Media, pp 122–129Google Scholar
- Office for National Statistics, Great Britain (2013) Internet Access—Households and Individuals 2013. Technical ReportGoogle Scholar
- Office for National Statistics, Great Britain (2014a) Annual Mid-year Population Estimates. Technical ReportGoogle Scholar
- Office for National Statistics, Great Britain (2014) Internet Access—Households and Individuals 2014. Technical ReportGoogle Scholar
- O’Hara B, Caswell K (2012) Health status, health insurance, and medical services utilization: 2010. Curr Popul Rep 2012:70–133Google Scholar
- Oliver MA, Webster R (1990) Kriging: a method of interpolation for geographical information systems. Int J Geogr Inf Syst 4(3):313–332CrossRefGoogle Scholar
- Olson DR, Konty KJ, Paladini M, Viboud C, Simonsen L (2013) Reassessing Google flu trends data for detection of seasonal and pandemic influenza: a comparative epidemiological study at three geographic scales. PLoS Comput Biol 9(10):e1003256CrossRefGoogle Scholar
- Osterholm MT, Kelley NS, Sommer A, Belongia EA (2012) Efficacy and effectiveness of influenza vaccines: a systematic review and meta-analysis. Lancet Infect Dis 12(1):36–44CrossRefGoogle Scholar
- Paul MJ, Dredze M (2014) Discovering health topics in social media using topic models. PLoS ONE 9(8):e103408CrossRefGoogle Scholar
- Pebody RG, Green HK, Andrews N, Zhao H, Boddington N et al (2014) Uptake and impact of a new live attenuated influenza vaccine programme in England: early results of a pilot in primary school-age children, 2013/14 influenza season. Euro Surveill 19(22):20823Google Scholar
- Petrie JG, Ohmit SE, Cowling BJ, Johnson E, Cross RT et al (2013) Influenza transmission in a Cohort of households with children: 2010–2011. PLoS ONE 8(9):e75339CrossRefGoogle Scholar
- Polgreen PM, Chen Y, Pennock DM, Nelson FD, Weinstein RA (2008) Using internet searches for influenza surveillance. Clin Infect Dis 47(11):1443–1448CrossRefGoogle Scholar
- Preoţiuc-Pietro D, Lampos V, Aletras N (2015) An analysis of the user occupational class through Twitter content. In: Proceedings of the 53rd Annual Meeting of the Association for Computational LinguisticsGoogle Scholar
- Presanis AM, Pebody RG, Paterson BJ, Tom BDM, Birrell PJ et al (2011) Changes in severity of 2009 pandemic A/H1N1 influenza in England: a Bayesian evidence synthesis. BMJ 343:d5408CrossRefGoogle Scholar
- Rasmussen CE, Nickisch H (2010) Gaussian processes for machine learning (GPML) toolbox. J Mach Learn Res 11:3011–3015MathSciNetMATHGoogle Scholar
- Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. MIT Press, CambridgeMATHGoogle Scholar
- Reed C, Angulo FJ, Swerdlow DL, Lipsitch M, Meltzer MI, Jernigan D, Finelli L (2009) Estimates of the prevalence of pandemic (H1N1) 2009. Emerg Infect Dis. doi: 10.3201/eid1512.091413
- Signorini A, Segre AM, Polgreen PM (2011) The use of twitter to track levels of disease activity and public concern in the U.S. during the influenza A H1N1 pandemic. PLoS ONE 6(5):e19467CrossRefGoogle Scholar
- Smith DJ, Lapedes AS, de Jong JC, Bestebroer TM, Rimmelzwaan GF et al (2004) Mapping the antigenic and genetic evolution of influenza virus. Science 305(5682):371–376CrossRefGoogle Scholar
- Smith GJD, Vijaykrishna D, Bahl J, Lycett SJ, Worobey M et al (2009) Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic. Nature 459:1122–1125CrossRefGoogle Scholar
- Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58(1):267–288MathSciNetMATHGoogle Scholar
- Zhao P, Yu B (2006) On model selection consistency of lasso. J Mach Learn Res 7:2541–2563MathSciNetMATHGoogle Scholar
- Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc B 67(2):301–320MathSciNetCrossRefMATHGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.