Background

Like many large-scale health surveys, the Australian Longitudinal Study on Male Health (Ten to Men) used a complex sampling scheme. This choice was made because sampling the target population using a simple random sample was not feasible. Sampling theory therefore plays an important role in our study design because it provides a framework for efficiency gains [1]. In Ten to Men, the key elements of the sample design were the use of stratification, multi-stage sampling and cluster sampling to select prospective participants and invite them to take part in the study. This design has implications for the analysis of data from Ten to Men for both inferences about population means or prevalences, and for quantifying the magnitude of associations between exposures and outcomes. Such analysis implications are, however, often poorly understood. At the extreme, views differ on whether to always adjust for aspects of the study design and sampling scheme at the analysis stage (including accounting for unequal sampling fractions using inverse-probability-of-selection sampling weights) or to never adjust. Korn and Graubard [2] give an excellent example of this controversy using US National Health and Nutrition Examination Surveys (NHANES). At the heart of this debate is a trade-off between mitigating against bias in estimation while faithfully representing the repeated sampling variation of the corresponding estimators in order to ensure accurate inferences.

Our aims here are to (1) describe each of these competing elements as they relate to Ten to Men; (2) detail and discuss the calculation of inverse-probability-of-selection sampling weights; and (3) provide recommendations for analyses that acknowledge these aspects of the design. We use a continuous variable (weight in kilograms) and a binary variable (current smoking status: smoker or non-smoker) as illustrations throughout. Our attention is restricted to the baseline (i.e. prevalent) wave of data collection. Our analyses are conducted in Stata [3], but the same principles and procedures apply to other statistical packages.

Methods

Overview of the ten to men sampling design

Stratification

When stratification is used in a survey design, it refers to the population being partitioned into groups prior to selection of the sample [4]. Samples are then taken independently within each stratum.

The Australian Statistical Geographic Standard [5] (ASGS) used by the Australian Bureau of Statistics (ABS) classifies each location within Australia as belonging to one of five levels of remoteness: “Major Cities”, “Inner Regional”, “Outer Regional”, “Remote” and “Very Remote”. It was not feasible to survey remote and very remote regions because of the travel time required for fieldworkers to recruit potential participants into the study (less than 2.3 % of the population lives in rural and remote Australia, an area that covers most of the country), so the study was restricted to sampling from the first three strata, that is, the major cities, inner regional and outer regional areas. Inner and outer regional areas were over-sampled to ensure that questions related to regional disparities in male health could be addressed adequately. These areas therefore represented 23 % and 20 % of the sample at baseline (for inner and outer regional area, respectively) compared with population proportions of 18 % and 9 %.

Multi-stage sampling

Sitting alongside the ASGS classification of remoteness is the ABS division of the population into “statistical areas”. The smallest units are Mesh Blocks (there are about 350,000 of these, containing on average about 75 people each), which aggregate into Statistical Area 1s (SA1s, with an average of 400 people and ranging from 200 to 800), then SA2s (averaging 10,000 people with a range of 3,000–25,000), SA3s, SA4s and finally SA5s which are the six Australian States and two Territories.

Ten to Men employed a multi-stage design. For the major cities stratum, SA1s were sampled first (proportional to size, where size referred to the number of boys according to the ABS 2011 Census of Population and Housing) and all households were sampled within SA1s. For the inner and outer regional strata, SA2s were randomly sampled first (also proportional to size using the same definition as that used for major cities) and then a fixed number of SA1s were randomly sampled within SA2s; at the final stage, households were sampled within SA1s. This additional step in the hierarchy of sampling SA2s for the inner and outer regional strata was introduced to reduce the distance fieldworkers had to travel.

Clustering

For all three strata, all eligible males within an eligible household were invited to participate in the study (see Table 1 in Australian Longitudinal Study on Male Health – Methods in this collection for a definition of the eligibility criteria). Thus, within a stratum, Ten to Men can be described as a cluster sample of eligible households, with SA1s defining the cluster. Households were therefore an additional level in the completely-nested hierarchy implied by the multi-stage sampling design.

Table 1 Estimated mean weight (kg) and prevalence of smoking using seven different approaches

Sample weights

The sampling design of Ten to Men implies that individuals within the major cities stratum did not have equal probabilities of selection because individuals living in SA1s with a larger number of boys (according to the ABS 2011 Census of Population and Housing) are more likely to be invited to participate since sampling was proportional to size where size refers to the number of boys. Although individuals in the inner and outer regional strata did, in theory, have equal probabilities of selection (due to the selection of the fixed number of SA1s within each SA2 effectively “cancelling out” the sampling of SA2s with probability proportional to their size), this was violated in practice due to variation in the participation fractions between households, SA1s and SA2s. This variability was an issue for the major cities stratum as well.

Sampling weights can be used to address bias in estimation due to unequal sampling fractions and to account for non-response when estimating a population parameter. These sampling weights are calculated as the inverse of the individual probability of participation. For inner and outer regional participants the weights are the inverse of the product of (1) the probability of an SA2 being selected: (2) the probability of an SA1 within SA2 being selected; and (3) the probability of an individual within an SA1 both agreeing to participate and providing usable data. For major city participants, the weights are the inverse of the product of (1) the probability of an SA1 being selected and (2) the probability of an individual with an SA1 agreeing to participate and provide usable data. Where a stratum is under-represented in the sample compared to the population then the sampling weights will up-weight data from individuals in that stratum in the analysis. Details on the calculations of the baseline sampling weights for Ten to Men are given in Appendix 1.

Results and discussion

Implications for estimating population means, prevalences and totals

Estimating means, prevalences or totals from a complex survey as though they were generated from a simple random sample has the potential to generate biased estimates and for the stated precision of these estimates to differ from the variability that we would observe in them under repeated sampling. The multi-stage sampling and selection of household clusters must therefore be acknowledged and accommodated when estimating population parameters. This can be done by either specifying a full multi-level model (by using a generalised linear mixed model) or by using a set of “survey” commands, both of which are available in most standard statistical software packages (including Stata). The multi-level model approach allows us to account for all levels of the hierarchy (i.e., individuals nested within households, households nested within SA1s etc.) but does not allow the effect of the sample weights to be incorporated into the analysis at the level of the individual participant (only group level weights are allowed at least for the suite of mixed models procedures we considered in Stata, e.g., mixed, melogit, and meglm). This means the estimates generated from this procedure may be biased.

The survey commands (at least those implemented by Stata and other major programs) only allow proper accounting for clustering at the top level of the multi-stage sampling hierarchy. This distinction is relevant in Ten to Men because for major cities, SA1s sit at the top of the hierarchy, whereas for the inner and outer regional strata, the larger SA2s were the first unit to be sampled. However, these commands do allow sample weights to be specified in the analysis. Consequently there is not a single procedure implemented in the commonly used software platforms that can account for the multi-stage design of the survey (which affects the calculation of the variance estimates) and produce unbiased estimates of population parameters when using data from all three strata (which the weights are intended to address).

We now demonstrate four approaches that reflect different ways of dealing with these issues when estimating a population parameter. Table 1 shows the mean weight (in kilograms) and a 95 % confidence interval (CI) for the corresponding population parameter calculated with no adjustment for the survey design or the sample weights (row A), no adjustment for the survey design but using sample weights (row B), using multi-level modelling without adjustment for the weights (row C) and using survey commands that allow for different combinations of adjustment for the multi-stage design and stratification as well as inclusion of the weights (rows D to G). Rows D and E present results using the SA1 as the primary sampling unit (PSU, the top level of the sampling hierarchy) whereas rows F and G uses the SA2 as the PSU. Rows D and F present estimates that are adjusted for stratification but E and G do not. Different estimates of smoking prevalence using the same analytic methods are also provided in Table 1, with the exception that the result from a multi-level logistic regression is excluded because such models do not estimate a parameter that has a population-level interpretation [6] and are thus not directly comparable to the other estimates.

As expected, the estimates of the population mean weight and the confidence intervals differ depending on the extent to which the sampling design characteristics are accommodated by the estimation procedure. When the population mean is estimated under the assumption of simple random sampling (row A), the mean weight is 83.9 kg (95 % CI 83.6 to 84.2 kg). Repeating this analysis but incorporating adjustment for the sample weights (row B) gives a mean weight of 81.4 (95 % CI 80.9 to 81.9). Using a multi-level modelling strategy that adjusts for the correlated observations within households, SA2s and SA1s (but does not adjust for stratification or sample weights) gives a similar mean to the unweighted analysis of 84.0 kg (95 % CI 83.5 to 84.5 kg) (row C). This mean appears to be biased probably because it does not account for the sample weights. Population estimates that account for the top level of the sampling hierarchy as well as weighting (and either with or without adjustment for stratification) are all identical to the level of precision reported (rows D to G). For example, when the SA1 is used as the PSU and the estimates are adjusted for stratification, the mean is 81.4 kg (95 % CI 80.8 to 81.9). This estimate is the same for all other combinations of PSU (SA1 vs SA2) and adjustment for stratification (no adjustment vs adjustment).

The results for analysing a binary variable, current smoking status, paint a similar picture. The estimate of the population prevalence is highest when no adjustment is made for the sampling scheme. It is lower when adjustments are made for this, with no appreciable difference between using the SA1 or the SA2 as the PSU or making adjustment for stratification.

Implications for estimating associations

Estimates of the association between variables (e.g., self-rated health and weight or smoking status) may also be affected by how the sampling scheme is treated in the analysis [7, 8]. Most modern statistical programs have commands that enable linear, logistic and other multivariable regression techniques to be used that account for stratification, multi-stage sampling and sample weights. The question that arises is: When should these commands be used? The conditions under which clustering can be ignored in the analysis of data are quite restrictive. In general, to be able to ignore clustering without producing variance estimates (and therefore confidence intervals) that are too narrow, we require at least that the distribution of the outcome of interest within given levels of risk factors and covariates does not differ between clusters [2]. In most scenarios it is far from obvious that this condition is satisfied: It is difficult to test empirically and will be untestable for unmeasured risk factors, covariates and confounders. Moreover, introducing adjustments for a stratified, multi-stage, clustered sampling scheme and for sample weights to accommodate unequal sampling fractions can lead to estimates that are highly variable [4]. This has implications for detecting associations between the exposure and an outcome. Against this, theoretical work by Scott and Holt [7] and by Neuhaus and Segal [8] show that estimates of measures of association in linear and logistic regression models are generally unbiased if we fail to account for clustering in the analysis. Lumley [9] makes a case for not using sample weights based on the argument that regression models often includes confounders and covariates that explain the variation in weights. This adjusts for any distortions in estimating the magnitude of the association between the exposure and the outcome that would have resulted from ignoring unequal sampling fractions. It is true that for some population-level measures of association the unbiasedness and variability of their estimates will not depend on whether or not the analysis incorporates the stratum-specific sampling fractions, but a full description of such scenarios is beyond the scope of this paper (see Lumley [9]). It is worth noting, however, that the variance of the measures of association, and as a consequence, the standard errors, confidence intervals and p-values calculated from them may be incorrect. Scott and Holt noted the extent of this mis-specified precision is generally less severe than when estimating means and prevalences.

We explore these issues with data examining the association between self-rated health and weight using linear regression. The measure of interest is the difference in mean weight (in kilograms) between two groups: those reporting excellent or very good health and those reporting good, fair or poor health. We compare results under a variety of conditions: where there is adjustment for the multi-stage design (no adjustment, adjustment for all stages of the hierarchy using multi-level modelling, adjustment using the SA1 as the PSU, adjustment using the SA2 as the PSU), adjustment for stratification (no adjustment, adjustment using the stratification variable as a covariate, adjustment using the survey command), and use of sample weights (yes or no). We also examine the association between self-rated health and smoking status using logistic regression, where the effect size of interest is an odds ratio. We again omit the results from analyses that use a multi-level logistic model for the same reasons discussed in the previous section.

In an analysis that makes no adjustment for the multi-stage design or for stratification or weighting (Table 2, row A), the mean difference between the two groups is −5.1 kg (95 % CI −5.8 to −4.5 kg). That is, those who describe themselves as having very good or excellent health report are, on average, 5.1 kg lighter than those who have good, fair or poor health. Adjusting for stratification by using a series of indicator variables for remoteness to enter it into the model as a categorical variable (row B) also gives a mean difference of −5.1 kg with 95 % CI −5.7 to −4.4 kg. Repeating the analysis in row A but with the use of sample weights to adjust for bias gives a smaller difference of −4.4 kg, but with a wider confidence interval than observed previously (95 % CI −5.6 to −3.3). Adjustment for stratification makes only a small difference to this result (row D).

Table 2 Estimated difference in mean weight and estimated odds ratios between participants with excellent or very good health and participants with good, fair or poor health

Repeating the analysis to account for all stages of sampling using a multilevel model (rows E and F) gives a mean difference of −4.9 kg (95 % CI −5.5 to −4.2), with further adjustment for stratification giving a difference of −4.8 kg (95 % CI −5.5 to −4.2). As with estimating population prevalences using multi-level models, it is not possible to easily account for the sample weighting in this context.

The final four rows in Table 2 show results obtained using the survey commands to estimate the population mean difference. When SA1s are defined as the PSU and sample weights are used (row G), the mean difference between the two groups is −4.4 kg (95 % CI −5.5 to −3.2). When no weights are used, the difference is −5.1 kg (95 % CI −5.8 to −4.4). Using the SA2 as the PSU gives similar results (rows I and J).

Thus, the estimate of the mean difference ranges from −4.3 kg to −5.1 kg. Taken as a whole, these results suggest that it is the adjustment for the sample weights that has the biggest impact on the results, with the adjustments for the sampling hierarchy and stratification having relatively minor influences on the estimate of the effect size. Nonetheless, all analyses would lead to the conclusion that the mean weight differs between the two groups, with those who have excellent or very good health being 4 to 5 kg lighter than those who have good, fair or poor health. This suggests that the way in which the sample design is accounted for makes some difference to the estimate of this measure of association on this occasion. This is supported by the second analysis in Table 2, which shows that the odds of being a current smoker are approximately 60 % lower for those with excellent or very good health compared with the odds for those with good, fair or poor health regardless of the way in which the study design is accommodated in the analysis.

Conclusion

Analyses of baseline data from Ten to Men will require explicit adjustment (through the use of sampling weights or procedures for clustering) for the sampling design in order to generate unbiased estimates with reliable measures of their precision that reflect their variability under repeated sampling. The application of adjustments will depend largely on the particular research question and the proposed statistical analysis. While we have illustrated these concepts in the context of the Ten to Men study, the issues are relevant to all clustered survey designs.

For estimates of a population prevalence and totals, the sampling design (including sample weights) should be adjusted for, since the estimators will (most likely) be biased and its precision understated if unadjusted, because the sampling variability will depend on the sampling fractions and hierarchical structure of the data. The issues are defining the PSU (SA1 or SA2) and whether or not to adjust for stratification. Regarding the PSU, our results show little difference in practice between using the SA1 or the SA2 as the PSU. Our recommendation is therefore to treat SA1s as the PSU. Similarly, while adjustment for stratification made no appreciable difference in this instance, we also recommend adjusting for stratification. In support of this, Appendix 2 contains the variable names and the Stata code (using the svy suite of commands or its equivalent in other packages) that allow this recommendation to be implemented. It is less clear with regard to measures of association between exposure and outcome whether ignoring the sampling design and, in particular, not using weights in analyses, will lead to biased estimates. On balance, we favour an approach that respects the sampling design and therefore incorporates this information into the calculation of any effect sizes.

Some researchers may find it helpful to conduct sensitivity analyses, where they compare unadjusted and adjusted estimates of prevalence and associations to determine which of the results are sensitive to the extent that the sampling scheme is accommodated in the analysis. We support this, with the proviso that a statistical analysis plan be prepared prior to commencing analysis (see Thomas and Peterson [10] or Rubin [11] for excellent discussions on the value of doing this in observational studies).