Background

Many countries rely on surveys to estimate the prevalence and incidence of health measures and health-related exposures that are not routinely collected for the whole population. These estimates are important for including resource planning and monitoring the effectiveness of national strategies. However survey samples are rarely representative of the target population. In particular, individuals who are older, sick, disadvantaged, or display poor health behaviours are often under-represented in surveys [1], an important omission as these are often the groups of most interest when identifying priorities for health and economic policies. In addition, many surveys are too small to allow sub-population analyses. The lack of alignment between survey and population data is growing as a result of declining response levels over time [2]. Common solutions to this problem include using inverse probability weighting and/or post stratification weights to correct survey-based results for any departure from representativeness of the target population arising from the sampling design and/or response level. However, this approach is often limited as weights are usually based on a restricted combination of population characteristics (typically age and sex). More elaborate bespoke techniques have been used to some effect to obtain superior population and sub-population level estimates from surveys [3,4,5], for example using multilevel models to estimate variables of interest in terms of respondents’ characteristics and also those of the area in which respondents live, with results from these models then weighted by the frequency of the modelled characteristics in the target population [4, 6]. While this type of approach offers an improvement on traditional weighting, including more variables common to the survey and target population, their use is still limited, in particular producing group level estimates rather than the individual data required for further statistical analysis.

In contrast to specific surveys, administrative data, for example population censuses, are often highly representative of the target population. However, in order to maintain high response levels, reduce participant burden, and keep costs down, a limited number of questions are asked [7]. For example, many population censuses do not include smoking in spite of its recognised importance in determining health outcomes [8]. Estimates of the effect of smoking and the impact of interventions or policies aimed at its reduction are therefore based on survey data, with accompanying problems of smaller samples and non-response, meaning that results may be unrepresentative of target populations and that time trends and estimates for certain subgroups and small geographic areas, often those of most interest in terms of improving health and reducing inequalities, are imprecise [9]. Indeed, government departments and others have advocated for smoking to be included in the UK census, but this has not been adopted, and is unlikely to be included in the future [10]. There is therefore a need for a simple, transparent and robust approach that allows analysis of representative data, such as population censuses, to be carried out as if smoking, or any other variable, was included.

One such method is to impute the variable of interest into the census using data from a survey in which the variable is included and applying standard multiple imputation (MI) techniques [11]. This idea is not new [12] but we are not aware of any instances of its practical application. We have therefore tested the validity of the approach using data from the English population census and a corresponding English survey, using the example of general self-rated health as this is available in both datasets, allowing direct comparison of data imputed into the census from the survey with the actual census data. These census data are widely used by government, policy-makers and researchers to explore socio-economic and geographical patterning of health [13] and their possible enhancement from survey data has great potential to improve health, economic and social policy.

Methods

Datasets

The census dataset is the Census Microdata Individual Safeguarded Sample (Regional): England and Wales [14], a 5% stratified sample of the April 2011 census that is proportionally representative of the population of those countries. This dataset provides anonymised individual level data on a wider range of variables than standard census tables, based on the whole population, which are typically limited to cross-tabulations of three or four variables. Self-rated health in the census was based on the question “How is your health in general? (Very good, Good, Fair, Bad, Very bad)”. Survey data are from the Integrated Household Survey [15], which combines core questions asked in the General Lifestyle Survey, the Living Cost and Food Survey, and the Labour Force/Annual Population Survey. We chose the April 2011 to March 2012 dataset as this was contemporaneous with census data and question wording was similar to that in the 2011 census. Self-rated health in the Integrated Household Survey was based on the question “How is your health in general; would you say it was … (Very good, Good, Fair, Bad, or Very bad)?”. Both the survey and census data were obtained from the UK data archive.

The population frame for both datasets was:

  1. a.

    Those living in England, as each country’s census varied slightly and there are national differences in how self-rated health is conceptualised in relation to objective health [16]; the choice of the largest UK country therefore simplified the analysis.

  2. b.

    Those aged 25–64, as highest qualification was included in our MI and fewer older people had formal qualifications.

  3. c.

    Those living in non-communal households and classed as usual residents (so excluding students usually living away from home and visitors), as the surveys do not comprehensively cover communal establishments.

Variables forming the basis of the MI were chosen to be predictors of self-rated health that were available and able to be harmonised across the two datasets (Table 1). Key variables (those included in the MI and also forming the basis of stratification in the examination of health patterning) were age, sex, housing tenure, and English region. Auxiliary variables (those included in the MI but not in the subsequent stratification) were education, marital status, country of birth, and ethnicity. The ethnicity variable presented in Table 1 is a five-category summary for convenience; the variable used in the imputation model split respondents into 18 distinct groups. A small number of survey records that were missing any harmonised variables were dropped to maintain the simplicity of the subsequent imputation; there were no missing variables for the census records as this is subject to imputation by the census offices before release. Before imputation the harmonised datasets were combined into one file and the self-rated health variable deleted from the census records, leaving a dataset with “missing” self-rated health to be imputed from the remaining (survey) data.

Table 1 Distribution of harmonised key and auxiliary imputation variables in census and survey datasets

Multiple imputation

MI is a flexible, simulation-based statistical technique for handling missing data that allows fully for the uncertainty arising from the missingness [17]. It is appealing compared with other methods as it makes estimation of variances and confidence intervals relatively straightforward. MI generates multiple sets of plausible values under the missing at random assumption [18]. Under this assumption, we consider that the missing self-rated health values from the census data are sufficiently informed by the key and auxiliary variables from both the census and survey data along with the self-rated health observations for the survey participants. In this case the missing self-rated health data in the census were imputed from MI models that included the main effects of all variables and all possible interactions between the key variables (age, sex, housing tenure, and English region). Four different models were explored: (a) standard logistic regression comparing very bad/bad versus fair/good/very good self-reported health; (b) poisson regression also based on dichotomous self-rated health, giving estimates of risk rather than odds of very bad/bad self-reported health; (c) ordinal logistic regression of the five ordered categories of self-rated health under the proportional odds assumption, considering the distance between each category to be equivalent; and (d) multinomial logistic regression of the five category self-rated health variable with no assumptions regarding the spacing or order of categories. Census records in the pooled dataset accounted for about 90% of the observations, therefore 90 imputations were carried out in line with recommendations that the number of imputations reflects the percentage missing [19]. The imputed datasets were analysed using standard techniques with resulting estimates and standard errors averaged across all 90 according to the “Rubin rules” [11], details of which are presented in Additional file 1. Imputations (and analyses) were performed in Stata v14 using the mi impute command [20].

Assessment

A priori (see initial analysis plan [21]) the performance of the imputation approach was assessed by comparing proportions of bad or very bad self-rated health from imputed versus actual census data across all 576 combinations of the key variables (8 age groups × 2 sex categories × 4 housing tenure categories × 9 English regions) as similar three and four way tables are used to realise census results. For comparison, these were compared with missing census proportions derived directly from the survey (without imputation). Associations between the actual census proportions of bad health and those from MI models (or directly from the survey) were explored graphically and using lines of best fit to these data (weighted according to the size of each age-sex-tenure-region category). Specifically, we considered the extent to which the lines of best fit deviated from the line of equality (y = x) with slope = 1 and intercept = 0, representing (theoretical) perfect agreement. We also considered the strength of the linear association between actual and imputed proportions of bad health using correlation coefficients.

Results

The original census and survey datasets comprised of 2,848,155 and 374,218 records respectively, of which 1,390,094 (49%) and 134,717 (36%) were respondents aged 25–64, living in England and who were usual residents in non-communal households. In total 634 (0.5%) respondents in the survey dataset had missing values for at least one key or auxiliary variable and they were omitted from analyses, leaving a total of 134,083 survey respondents in the analytical dataset. Distributions of imputation variables in the census and survey are presented in Table 1. Distributions were broadly similar in the two datasets with survey respondents slightly older, more educated, and more likely to be female, own their home, and be married than those from the census.

Census and survey responses to questions on self-rated health are presented in Table 2. Overall, survey respondents were less positive about their health, with 78% rating it as good or very good compared with 83% of the census. Figure 1 presents a scatter plot of the proportion of survey versus census respondents in each of the 576 combinations of age, sex, tenure and region who rated their health as bad or very bad. The dashed line is the line of equality, representing perfect agreement between census and survey measures, while the solid line (shaded line) shows the regression line of best fit (95% confidence interval) describing the association between the two. The intercept and slope of this line of best fit are presented in Table 3. There was a strong linear relationship between the proportions in the two datasets (correlation = 0.93; Table 3). However, the survey overestimated the proportion of respondents with bad or very bad self-rated health, as evidenced by the lack of correspondence between the regression line (intercept (95% confidence interval): 0.01 (0.00, 0.01); slope (95% confidence interval): 0.82 (0.79, 0.84)) and the line of equality.

Table 2 Overall distribution of self-rated health in original census data versus data from or imputed from survey data
Fig. 1
figure 1

Comparison of proportion of bad or very bad self-rated health in original census data versus survey data

Table 3 Linear associations between proportion of bad or very bad self-rated health across 576 groups comparing original census data with data from or imputed from survey data

Similar results for the multiply imputed data are presented in Fig. 2 (Tables 2 and 3). The overall distribution of bad or very bad self-rated health imputed into the census from survey data using standard logistic or poisson regression was very similar to that for the raw survey data (6.3% and 6.2% of imputed census data versus 6.2% of raw survey data were bad or very bad) and, therefore, differed from the original census values (5.1%). Results for the 576 combinations of age, sex, tenure and region (Fig. 2, top left and right) were also very similar to those for raw survey data, with a strong linear relationship, but were generally overestimations relative to the original census values (logistic intercept: 0.00 (− 0.00, 0.00); slope: 0.82 (0.79, 0.84); correlation: 0.95; poisson: 0.00 (− 0.00, 0.00); 0.80 (0.78, 0.82); 0.95). The overall distribution of self-rated health imputed into the census using ordinal logistic regression was, again more similar to the original survey data than the original census data (6.2% bad or very bad self-rated health). Initially, it seemed that the association for the 576 categories was a better fit to the original census data (intercept: − 0.01 (− 0.01, − 0.00); slope: 1.00 (0.98, 1.03)) than that from the raw survey data. However, while there was reasonable linear agreement between values in the middle of the range, the imputed data substantially overestimated the proportion of bad or very bad self-rated health at the lower and upper ends of the distribution and, in practice, a quadratic model was a better fit in describing the association between imputed and original census values (Fig. 2, bottom left). Results for data imputed into the census using multinomial logistic regression were again very similar to those for the raw survey data (6.5% bad or very bad self-rated health; intercept: − 0.00 (− 0.01, − 0.00); slope: 0.83 (0.81, 0.85); correlation: 0.95; Fig. 2, bottom right).

Fig. 2
figure 2

Comparison of proportion of bad or very bad self-rated health in original census data versus data imputed from survey data

Discussion

Our aim was to assess whether applying MI to data from the Integrated Household Survey would provide a simple, accessible and robust means of predicting the (known) prevalence of bad or very bad self-rated health in the UK census. The major strength of our analysis is the ability to test the performance of MI against a known result; often such cross-validation is missing from assessments of methods to estimate population parameters. Our results suggest that distributions of imputed self-rated health were more similar to the original survey than the census data and there was little additional benefit offered by standard MI. This highlights the importance of comparison with known data in the development of tools for enhancing routine datasets.

Although we included a wide range of predictors of self-reported health and all possible interactions among our four key variables, the imputation process was not sufficient to completely account for differences between the survey and census populations. As multiple imputation tends to perform less well with higher levels of missing observations, [22] the large percentage (90%) of records in the pooled dataset that were census ones may partially explain the marginal benefit of our application. Including additional interactions with auxiliary variables might have improved the model but this was not possible in our standard software (Stata v14) and may require more specialist software, making it unsuitable for general application. The accuracy of the current model might be improved by incorporating machine learning into the process, enabling the identification of an optimal prediction model with a rich array of higher order terms beyond selected two-way interactions. Additionally, the adoption of a generalized linear mixed model based approach allowing for cluster specific weighting may also enhance accuracy. Likewise, the incorporation of survey design weights in multiple imputation models could be used to improve survey-based estimates [23, 24]. However, these more complex approaches would limit accessibility. However, perhaps the most important limitation is that the method relies on having harmonised variables common to both datasets. In the current analysis, variables were chosen for their known associations with self-rated health and on the basis that they could be harmonised across the census and the survey and it is possible that important variables were omitted from the current models. The strength of administrative data such as censuses lies in their representativeness but this is often tempered by the need to restrict the number of questions asked. The range of variables available for MI will therefore be limited by availability and comparability across datasets and this may restrict the practical applications of the approach in some circumstances.

There is concern that survey-based estimates of population parameters are not sufficiently robust to inform resource planning and policy development and assessment, especially as sub-populations who are most sick, most disadvantaged, and with the least healthy lifestyles are increasingly underrepresented [2]. In addition to limiting the generalisability of the findings from analyses of survey data, the groups that are most often missing are those of greatest potential importance in determining economic and public health policy. Many surveys derive and provide general weights in order to make results from (weighted) analyses representative of the population from which the data are drawn. However, these weights are frequently based on just a few population characteristics (often simply age and sex) and may be limited in their capacity to adequately correct estimates [25, 26]. Increased access to administrative databases and other sources of “big data”, with their vastly more extensive population coverage, creates a potential opportunity to overcome these shortcomings [27, 28]. However, these datasets often do not include the range or quality of variables that surveys do, for example the UK census does not include questions on smoking, generating a need for alternative analytical methods to bridge the gap.

Conclusion

We have explored an easily applied and accessible method using MI models to impute individual level data, amenable for use in further modelling, from a survey into a larger but less comprehensive administrative dataset. However, our results demonstrate that the practical application of this approach is not straightforward. We do not discount its use in the context of enhancing routine datasets. However, further work is clearly needed to explore its validity and application in this context and, in particular, it is important to understand how to identify and develop the best imputation models and how to select the most useful surveys and variables for inclusion in them. We recommend following our example of comparing imputed survey values with those already known in the administrative data in order to more rigorously assess the performances and validity of different approaches and datasets.