Estimation of a cluster-level regression model under nonresponse within clusters

When sample surveys are clustered and subject to non-response, it is possible to study cluster-level association between response rates and cluster-level quantities derived from survey variables. The existence of association may suggest informative nonresponse with possible biasing effects. In this paper, this problem is studied for the case where the aim is to fit a cluster-level regression model. Two possible underlying models for nonresponse with potential biasing effects are considered. Alternative estimators of regression coefficients under these models are proposed. The properties of these estimators are studied in two simulation studies and with real data from a survey of employees, where the clusters consist of workplaces.


Introduction
A feature of nonresponse in clustered survey data is that it is possible to study cluster-level association between response rates and aggregate statistics, such as means or proportions, for survey variables of interest. Thus, if p i denotes the response rate among elementary sample units in cluster i andȳ ri denotes the mean of a variable Y among responding units within cluster i then it is possible to study the association between these two quantities across clusters, perhaps conditional on some other cluster-level factors. In contrast, no equivalent association can be observed at the elementary unit level (in unclustered data) since Y is missing for nonresponding units.
The occurrence of such cluster-level association between p i andȳ ri may be suggestive of some kind of informative nonresponse. In this paper, we consider this issue when the objective is to fit a regression model at the cluster level, as may be of scientific relevance when the clusters are of analytic interest. As a motivating example in Sect. 3, we consider a survey of employees, where the clusters consist of workplaces and there is analytic interest in how the well-being of employees at a workplace depends upon different kinds of innovations at the workplace. Another example is a survey of hospital patients about the quality of care received, where there may be interest in analysis at the hospital level but nonresponse may arise at the individual patient level [9].
We shall be interested in the case where testing for inclusion of p i or some function of it as a covariate in the model may be used as some kind of diagnostic for informative nonresponse. We shall introduce two models of nonresponse mechanisms which might provide explanations for such association and which lead to bias in the estimation of the regression coefficients if the nonresponse is ignored by simply running the regression on respondent data. Moreover, we shall consider possible ways of controlling for this bias by including p i or some function of it as a covariate in the model.
The basic approach of ordinary least squares (OLS) estimation using respondent data will not lead to biased estimation of the regression coefficients if nonresponse is conditionally independent of Y given the explanatory variables included in the model (and if unequal sample inclusion probabilities can be ignored). In standard (unclustered) settings, it can be difficult to detect departures from this conditional independence condition without strong modelling assumptions [8, sect. 1.3] . In our clustered setting, however, we suggest that the use of p i , or some transformation of it, as an auxiliary explanatory variable can provide a relatively simple way to detect at least some forms of informativeness in the nonresponse. In this paper we consider two possible models which might underly such an effect and, for which, the inclusion of the auxiliary variable in the regression offers some control for nonresponse bias under departures from the conditional independence assumption.
Nonresponse in two-stage surveys can operate at either stage and, in this paper, we focus on the problem when nonresponse occurs at the second stage and suppose that data from all sampled clusters are available, even though data for elementary units are missing from many if not all sampled clusters. For simplicity, we suppose that the explanatory variables in the regression are defined at the cluster level and are not missing.
We present a simulation study in which we consider the performance of estimators based on each model under the assumption that the data are generated from either the assumed model or the alternative model. We also present a real application using data from a survey of workplace employment relations in Great Britain.
The analytic focus of this paper differs from the main literature on clustered survey nonresponse which has considered problems of estimation of finite population parameters, such as means or totals, and associated weighting and imputation questions. Thus, Yuan and Little [12,13] proposed model-based inference approaches for both unit and item nonresponse; Skinner and D'Arrigo [11] and Kim et al. [6] considered weighted estimation; Shao [10], Haziza and Rao [3] and Lago and Clark [7] considered imputation.
The formal framework for the paper is set out in Sect. 2 and the motivating example is given in Sect. 3. Possible models which could account for association betweenȳ ri and p i are introduced in Sect. 4 and estimators of regression coefficients based on these models are proposed in Sect. 5. The properties of these estimators are studied in simulation studies in Sect. 6 and with real data from the motivating example in Sect. 7. Some concluding comments are given in Sect. 8.

Basic set-up and regression model of interest
We consider a clustered population, containing N clusters with M i elements in cluster i = 1, 2, . . . , N . We suppose that two-stage sampling is employed, where n clusters are selected and m i elements are selected in the ith sampled cluster (i = 1, 2, . . . , N ). Each stage of sampling may involve simple random sampling, for example, but we discuss the role of the sampling scheme more generally later in this section. Without loss of generality, we write the sampled clusters as i = 1, . . . , n and the sampled elements in cluster i as j = 1, . . . , m i . We consider regression analysis for an outcome variable Y and a vector of covariates x = (1, x 1 , . . . , x k ) . We suppose that Y is defined at the element level with y i j denoting the value of Y for the jth population element ( j = 1, 2, . . . , M i ) in the ith cluster j=1 y i j denoting the population mean of Y in the ith cluster. We suppose that interest focusses on the dependence ofȲ i on a vector of covariates x, defined at the cluster level with x i denoting the value of x for the ith cluster.
We define the regression model of interest by where E m (.) denotes expectation with respect to a model and β is the vector parameter of interest. The expectation is implicitly taken to be conditional on x i . We write R i j = 1 if y i j is observed and R i j = 0 if y i j is missing as a result of nonresponse by element j in cluster i for i = 1, . . . , n and j = 1, . . . , m i . We suppose x i is always observed for i = 1, . . . , n. We denote the number of respondents in cluster i by r i = m i j=1 R i j and the associated response rate by p i = r i /m i . We shall adopt a purely model-based approach to inference about β in this paper. In particular, as our simplest approach to estimating β, we replaceȲ i by the respondent mean y ri = m i j=1 R i j y i j /r i , where we assume r i ≥ 1, and estimate the model in (1) using ordinary least squares (OLS) to regressȳ ri on x i . We refer to the resulting estimator of β as the OLS estimator. We shall allow for the clustered structure of the population in variance estimation by bootstrapping at the cluster level.
There is a large literature on the role of the sampling scheme in inference about regression models. See e.g. Chambers and Skinner [2]. In this paper, we shall assume that the nature of the sampling scheme is such that it can be ignored for inference about β (beyond the allowance for clustering in bootstrap variance estimation and for nonresponse) and, in particular, sampling weights can be ignored. The implicit assumption here is that the sample inclusion probabilities are unrelated to y i j given x i . We touch on this point again at the end of the paper. Moreover, we shall not explore standard weighting or imputation methods which might be used to handle the problem of nonresponse here. We suppose that nonresponse will be addressed by appropriate choice of the regression model and associated estimation methods.

Example: workplace employment relations survey
To illustrate the set-up, we describe here a regression analysis using data from the 2004 Workplace Employment Relations Survey (WERS) [5], where the elementary units consist of employees and the clusters consist of workplaces. The broad analytic objective is to study how the well-being of employees at a workplace is affected by innovations in working practices at the workplace. Our analysis is based on a much fuller consideration of how innovation affects worker well-being, presented in Bryson et al. [1].
A sample of workplaces in Great Britain with at least 5 employees was selected, with faceto-face interviews conducted with senior managers with responsibility for employee relations and personnel matters. These managers then distributed self-completion questionnaires to 25 randomly selected employees at the workplace (or to every employee in workplaces with 5-25 employees). The response rate of employees to this questionnaire was about 60%. We consider here data on 13,500 employees at 1238 workplaces on a job satisfaction variable y i j , derived from responses by employees to questions about satisfaction with various aspects of their job and an innovations variable x 1i , derived from responses by the manager to questions asking about changes initiated by management at the workplace. These variables are described in more detail in Sect. 7. We follow [1] in using only data on private sector workplaces.
Fitting the workplace-level regression model in (1) withȳ ri as dependent variable and just x 1i and an intercept as independent variables using OLS gives an estimated coefficient of x 1i as −0.29 (with standard error 0.06). This negative association between innovation and job satisfaction was also found by Bryson et al. [1] in their much fuller analyses including other explanatory variables as well as an analysis using instrumental variables to control for the possible endogeneity of innovation. If we also include p i as an explanatory variable in our linear regression model, we find that it has a least squares coefficient 0.71 (standard error 0.35), differing significantly from 0 at a 0.05 level. The positive coefficient indicates that workplaces with higher response rates to the employee questionnaire tend to have higher levels of job satisfaction, which may be plausible.
The concern here is that if nonresponse is informative, with the probability of response increasing as y i j increases, will this bias the estimation of the regression model of interest and in what way?

Models for nonresponse
In this section, we begin by formalising the notion of informative nonresponse in our context and then consider two models of the nonresponse mechanism which might account for association betweenȳ ri and p i , conditional on x i , as observed in the application in the previous section.
We define models for nonresponse in terms of the relationship between the random vectors y i = (y i1 , . . . , y im i ) and R i = (R i1 , . . . , R im i ) for i = 1, . . . , n. Note that noninformative sampling is assumed so that the distribution of y i j is the same whether i j is in the sample or not. Moreover, we assume that the pairs (y i , R i ) are independently distributed for different clusters i = 1, . . . , n.

Noninformative nonresponse
We say that the nonresponse is noninformative if y i and R i are conditionally independent given x i . It follows from (1) that under this condition we have and that the OLS estimatorβ is unbiased for β.

Normal selection model
As our first possible source of informative nonresponse, we formulate a two-level selection model following Heckman [4]. See also Little and Rubin [8, sect. 15.4]. This model is related to the parametric cluster-specific nonignorable nonresponse model in Yuan and Little [12].
To model the response outcome R i j , we introduce a variable u i j so that R i j = 1 if u i j > 0 and R i j = 0, otherwise. We then specify a model for both y i j and u i j as: where is a vector of covariates which are assumed to influence nonresponse and may include covariates in x i . It is assumed that the disturbance terms in (4) and where the distribution in (6) is taken to be conditional on x i and z i . Nonresponse is noninformative if y i and R i are conditionally independent given x i and z i , which arises if σ δ = 0, using (4), (5) and (6). In this case, and assuming (1) holds, the OLS estimator is unbiased for β, as in Sect. 4.1. In general, however, this will not be the case. Thus, we may write From (6) we can write, i j = σ δ σ −2 Following Heckman [4], we have is the probability density function of the standard normal distribution and (.) is the cumulative distribution function of this distribution.
We might view the term λ(z i ψ) in (9) as a missing auxiliary variable which could induce bias in the OLS estimatorβ O L S and could be 'proxied' by p i if p i is included as an extra explanatory variable when fitting model (1) to respondent data. To explore this idea, note that we might express p i approximately as p i ≈ E(R i j ) and we have Hence, we might approximate the term λ(z i ψ) in (9) by λ( −1 ( p i )), i.e. a transformation of p i , and this could provide the basis of one explanation for the significance of p i when it is added into the regression in Sect. 3.

A simple informative nonresponse model
We next consider a simple informative nonresponse model, which may be considered as a special case of the previous selection model or as a simple version of a pattern-mixture model considered in Little and Rubin [8,Example 15.10]. We suppose that for some constant δ, that is that respondents and nonrespondents differ in their expected value of y i j by a fixed amount, given the values of the covariates. When nonresponse is noninformative we have δ = 0, but nonresponse is informative in general. It follows that this is a special case of the selection model by noting that, under this model, we may show that E( To explore the consequences of this model, define the nonrespondent mean of y i j bȳ and note that we may write the sample mean of y i j as It follows, using (11), that we may write and hence, using (1), that Analogous to the way we viewed λ(z i ψ) in (9), we may view (1 − p i ) in (13) as a source of bias in the estimation of β arising from the simple informative nonresponse model.

Estimation of β and testing of informativeness
It follows from the discussion in the previous section that the OLS estimatorβ O L S will, in general, be biased under either of the two informative nonresponse models considered there unless σ δ = 0 or δ = 0. We now consider how we might seek to remove this bias by constructing estimators based upon each of these models. In both cases, our approach is to include an additional covariate to control for the selection effect.
One approach is to use the simple informative nonresponse model, where it follows from (13) that we just need to include 1 − p i as an additional covariate and regressȳ ri on x i and 1 − p i . We refer to the resulting estimator of β as the simple informative estimator. This is similar to a method discussed in Yuan and Little [13] in which an estimated cluster-level response rate is included as a covariate in a model used to adjust for nonresponse. We next turn to the normal selection model and define a two-step estimator of β (c.f. Heckman [4]) as follows Step 1. Noting that Pr(R i j = 1) = (z i ψ) from (10), obtain an estimatorψ of ψ by probit regression of R i j on z i . Step 2. Calculate the estimated inverse Mills ratio, plug this into (9) and regress y i j on x i and this estimated inverse Mills ratio to obtain estimators of β and c.
A simpler version of this estimator, which is not dependent on the choice of z i , is obtained by using the large m i approximation (based on (10)) and replacing λ(z iψ ) by λ( −1 ( p i )) in the two-step approach. We refer to this as the p iapproximate two step estimator.
An approximate maximum likelihood (ML) estimator is obtained as follows. Under the working assumption that the observations are independent, the log likelihood for the observed data and the model in (4) where f (. | .) denotes the relevant conditional probability distribution. This log likelihood may be maximised numerically using the facts that f ( . Evaluation of the log likelihood requires evaluating Pr(u i j > 0) for all cases (i, j) and evaluating the probability density function of N (x i β, σ 2 ) and Pr(u i j > 0|y i j ) for all cases with R i j = 1.
We shall estimate the standard errors of each of the above point estimators of β using a bootstrap approach with 1000 replicates in which the sampled clusters i = 1, ..., n are resampled by simple random sampling with replacement.
The above estimation methods imply approaches to testing the informativeness of the nonresponse. Lettingδ denote the estimator of δ in (13) using the simple informative estimator, the 't-statistic' obtained by dividingδ by its bootstrap standard error will have a standard normal distribution under noninformativeness and may be used to test this assumption. Note that the validity of this null distribution does not depend on the model assumption in (11) but only on the general assumption in (2). Similarly, it is possible to construct a test from the t-statistic formed by dividing the two-step estimator of c in (9) by its bootstrap standard error. The validity of the null distribution of this test depends not only on the assumption in (2), but also on the assumption that z i does not feature on the right hand side of (4) (other than as a component of x i β).

Simulation studies
We now present two studies of the properties of the estimators introduced in the previous section, one in which values are generated from the normal selection model and one with values generated from the simple informative nonresponse model.

Study 1 based on normal selection model
Here we vary the correlation ρ = σ δ /(σ σ δ ), which governs the degree of informative missingness in the model via (6) and fix the scale parameters at σ = 3 and σ δ = 1. We set N = 1, 000 and take various combinations of the sample sizes n and m i , assuming the latter are constant with m i = m. We set x i = (1, x 1i ) so that k = 1 and similarly z i = (1, z 1i ) . The following simulation steps are repeated 10,000 times.
Step 1 Generate i j and δ i j from a bivariate normal distribution following (6).
Step 2 Generate x 1i and z 1i from a bivariate normal distribution with means zero, variances one and correlation 0.5.
Step 3 Determine y i j from (4) using values of i j from step 1 and x 1i from step 2 and β 0 = 0, β 1 = 1.
Step 4 Similarly, determine u i j from (5) and values of δ i j from Step 1 and z 1i from Step 2 and γ 0 = 0, γ 1 = 1. These values imply an expected overall response rate of 50 %.
Step 5 Compute values of the alternative estimators defined earlier.

Study 2 based on simple informative nonresponse model
We again set N = 1000 and x i = (1, x 1i ) so that k = 1. Now we vary the quantity δ in (11) which governs the degree of informativeness. We repeat the following steps, again 10,000 times.
Step 1 Generate x 1i and z 1i as in Study 1, and then set Step 2 Generate R i j from P(R i j = 1) = π i .
Step 4 Compute values of the alternative estimators defined earlier.
The model in (17) may be viewed as a pattern mixture model in the sense that it generates R before generating y|R in order to determine (R, y) and it ensures that (11) holds. We also undertook a similar study where p i in (17) is replaced by π i and obtained very similar results.

Results of study 1
We present results for two choices of n and m: n = 20, m = 25 in Table 1 and n = 100, m = 5 in Table 2. It seems important to consider different values of m since the 'information' in p i may be expected to decline as m decreases and we would expect this to affect the relative performances of the estimators. The first choice of m corresponds to the maximum value of m i in the motivating application. Results for the estimators with least absolute bias, variance and mean squared error (MSE) are presented in bold. With 10,000 runs, the simulation error is small. The simulation standard errors for the means of the estimates of β 0 and β 1 in Tables 1 and 2 range from 0.002 to 0.004.
In Table 1 we see that the OLS estimators of β 0 and β 1 , as expected, display bias when ρ = 0. All the other estimators of these parameters tend to have less bias when ρ = 0 and for all these other estimators (unlike the OLS estimator) the reduced bias tends to be dominated by the standard error when considering MSE. The OLS estimator does, however, have lower variance than the other estimators, at least for smaller values of ρ and it is the preferred estimator when ρ = 0. The OLS estimator also has MSE comparable to that of the other estimators when ρ = 0.2. For larger values of ρ, however, the OLS estimator tends to have a much higher MSE. The other estimators have broadly similar performance across the range of values of ρ here. Focussing on β 1 which is of primary interest, the approximate ML estimator tends to perform slightly better in terms of MSE when ρ = 0, followed by the two step estimator. Table 2 shows some similarity of results to Table 1 but some differences. When ρ = 0 the OLS estimator is again biased and all the other estimators reduce this bias. However, it is no longer the case that the bias is dominated by the variance for the other estimators. The bias for the simple informative and p i -approximate two step estimators tends to be worse than that of the two-step and approximate ML estimators and the bias of the former estimators is particularly pronounced when ρ = 0.5 or 0.8. The poor performance of the The true values are β 0 = 0, β 1 = 1 Figures in bold face identify the estimator with least absolute bias, variance or MSE, depending on the column simple informative and p i -approximate two step estimators when m is as small as 5 may be attributed to the dependence of these estimators on p i , which will be very noisy for such a small value of m. The two-step and approximate ML estimators tend to perform better than the other estimators for larger values of ρ.

Results of study 2
We consider here the results only for n = 20, m = 25, as in Table 1. Table 3 shows that, as in Study 1, the OLS estimators of both β 0 and β 1 are biased for each non-zero value of δ considered here, with bias increasing as δ increases.
Using the simple informative estimator removes the bias but with inflated variance. The p i -approximate two-step estimator behaves similarly to the simple informative approach. The two-step estimator performs a little worse than these estimators. The approximate ML estimator has more bias than the other estimators, excluding OLS, but the bias is dominated by the variance for these sample sizes and, overall, this estimator performs as well as the simple informative estimator in terms of MSE.

Application to workplace employment relations survey
We now apply the new methods devised in Sect. 5 to the analysis using WERS survey data introduced in Sect. 3. The outcome variable y i j is a job satisfaction measure, based on responses of employees to the question " How satisfied are you with the following aspects of your job?" for the following eight aspects: achievement you get from your work; the scope for using your own initiative; the amount of influence you have over your job; the training you receive; the amount of pay you receive; your job security; the work itself; the amount of involvement you have in decision-making at this workplace. Responses on each of these eight aspects were recorded on a 5-point Likert scale from very satisfied to very dissatisfied, coded from −2 to 2, and then summed to give an overall measure varying between −16 and 16.
As our primary independent variable of interest x 1i , we considered a number of workplace innovation variables, following Bryson et al. [1]. We decided to use a labour innovations variable here, since this was found by Bryson et al. [1] to be most strongly related to job satisfaction. This variable is obtained from responses from the manager at each workplace to the question "Over the past two years has management here introduced any of the changes listed on this card?", where the four changes listed are: changes in working time arrangements; changes in the organisation of work; changes in work techniques or procedures; introduction of initiatives to involve employees. The labour innovations variable x 1i is then defined as the number of positive responses to these questions and thus ranges from 0 to 4. In addition to considering regression models with just x 1i and an intercept as covariates, we also considered models with a much richer vector x i , including also a series of control variables, following Bryson et al. [1]. These consist of both workplace level variables, such as industry, union membership, workplace employment and the local unemployment rate, obtained from the manager or from official statistics, and workplace means of individual level variables, such as gender, age, academic qualifications, occupation and disability, obtained from employees.
We considered four of the estimators described in Sect. 5: the OLS estimatorβ O L S and the simple informative, two-step and p i -approximate two step estimators. In the case of the two-step estimator, it is necessary to specify the variable z i and for this we used the (population) number of employees in the workplace, as a variable with the potential to be related to non-response. We did attempt to use the approximate ML estimator but do not report on results since it frequently failed to converge. The standard error of each estimator of β was estimated from a bootstrap approach which resampled the clusters with replacement 1000 times.  Table 4 presents estimates of the regression model of job satisfaction on the labour innovation variable. As noted in Sect. 3, higher levels of innovation were associated with lower levels of job satisfaction. The covariate (1 − p i ) was found to be significant (at the 0.05 level) in the simple informative estimator, both when the only covariate consists of labour innovation and when control variables were included in the covariate vector too. However, the estimated coefficient of labour innovation is unchanged by the inclusion of the covariate (1 − p i ) although the standard error is slightly increased, as might be expected.
The coefficient of the estimated inverse mills ratio variable in the two-step estimator was also significant at 0.05 level, although it was not significant when control variables were also included in the model. The estimates of β 0 and β 1 are different for this estimator, especially the intercept which has a greatly increased standard error. The correlation between p i and z i here is weak (only −0.05) and the fact that the coefficient of the estimated inverse mills ratio becomes insignificant once control variables are included suggests that the two-step method may be capturing something other than the effect of nonresponse and hence produces somewhat different results. The weak correlation between p i and z i may also be a reason behind the non-convergence of the ML estimator.
The p i -approximate two step estimator makes no use of the variable z i and performs similarly to the simple informative estimator, although with a larger standard error for β 1 . The coefficient of λ( −1 ( p i )) differs significantly from zero, whether or not the additional control variables are included in the model.

Conclusion
In this paper we have studied how the inclusion of a cluster-level non-response rate 1 − p i as a covariate in a cluster-level regression may be used to detect certain kinds of informative nonresponse. In our application, the coefficient of 1 − p i was found to differ significantly from zero. Such a finding might arise because 1 − p i is acting as a proxy for some omitted cluster-level variable, but in our study we found the coefficient still differed significantly from zero, even after including many control variables which might be expected to influence the outcome. We also showed by theory and through our simulation study that the inclusion of 1 − p i as a covariate can reduce the bias of the OLS estimator of the coefficients of interest under certain informative nonresponse models. Whilst it can increase standard errors, we still showed in the simulation study how the mean squared error can be reduced.
For comparison, we also considered a form of normal selection model, as in Heckman [4]. This was used both as a basis for a simulation study to assess the use of 1 − p i as a covariate and also to construct alternative estimators. Inevitably, it was possible to improve on the use of 1 − p i as a covariate, if the model was correct and the covariate vector z i used to model nonresponse in this model is known. However, the gain was modest unless the correlation ρ is large and this approach is more complex given the requirement to specify the vector z i .
We also showed how we could approximate the selection model approach and avoid the need to specify z i by incorporating a non-linear transformation λ( −1 ( p i )) of p i as a covariate. We found this approach performed similarly to including 1 − p i as a covariate, both in our simulation study and in our application, presumably because the transformation is fairly linear across the range of values of p i that we used. The simulation study showed a very slight advantage of the non-linear transformation.
We have not considered the impact of weighting in this paper. The procedures we have studied should extend naturally to take account of weighting for unequal sample inclusion probabilities in the definition of p i , but this does not immediately appear to raise new conceptual issues. Weighting for nonresponse is different. Conventionally, such weighting might be used in surveys when auxiliary variables are available and when a noninformative nonresponse assumption might be plausible when these variables are conditioned upon. However, when the objective is to fit a regression model, it is natural to seek to control for nonresponse by including such auxiliary variables as covariates rather than by employing weights and this is what we have done here. The informative nature of the nonresponse in our application was not controlled for by using auxiliary variables as additional covariates and we do not anticipate that it would be by the use of weights constructed from the same auxiliary variables, but this topic might merit further investigation alongside the role of sampling weights.