Skip to main content

Cluster Correlations and Complexity in Binary Regression Analysis Using Two-stage Cluster Samples


In a two-stage cluster sampling setup for binary data, a sample of clusters such as hospitals is chosen at the first stage from a large number of clusters belonging to a finite population, and in the second stage a random sample of individuals such as nurses is chosen from the selected cluster and the binary responses along with covariates are collected from the selected individuals. Because the hypothetical binary responses from the individuals in a given cluster/hospital under the first stage sample are correlated (as they share a common cluster effect), this correlation plays a complex role in developing the second stage sample based estimating equations for the underlying regression parameters. Moreover, the correlation parameters have to be consistently estimated too. In this paper, unlike the existing studies, we demonstrate how to accommodate (1) the so-called inverse correlation weights arising from a finite population based generalized quasi-likelihood (GQL) estimating function, on top of (2) the sampling weights, to develop a survey sample based doubly weighted (SSDW) estimation approach, for consistent estimation of both regression and correlation parameters. For simplicity, we refer to this GQL cum SSDW approach as the SSDW approach only. The method of moments (MM) cum SSDW approach will be simpler but less efficient, which is not included in the paper. The estimating function involved in the proposed SSDW estimating equation has the form of a sample total, which unbiasedly estimate the corresponding finite population total that arises from the aforementioned generalized quasi-likelihood function for the targeted finite population parameter. The resulting SSDW estimators, thus, become consistent for the respective parameters. This consistency property for the SSDW estimator for both regression and cluster correlation parameters is studied in details.

This is a preview of subscription content, access via your institution.


  • Binder, D. (1983). On the variances of asymptotically normal estimators from complex surveys. Int. Stat. Rev. 51, 279–292.

    MathSciNet  Article  Google Scholar 

  • Binder, D. and Roberts, G. (2009). Design-and model based inference for model parameters,.

  • Breslow, N.E. (1993). Approximate inference in generalized linear mixed models. Journal of American Statistical Association 88, 9–25.

    MATH  Google Scholar 

  • Burdick, R.K. and Sielken Jr, R.L. (1979). Variance estimation based on superpopulation model in two-stage sampling. Journal of American Statistical Association 74, 438–440.

    MathSciNet  Google Scholar 

  • Burgard, J.P. and Dörr, P. (2021). Generalized Linear Mixed Models with Crossed Effects and Unit-specific Survey Weights. Journal of Computational and Graphical Statistics.

  • Christensen, R. (1984). A note on ordinary least squares methods for two-stage sampling. Journal of American Statistical Association 79, 720–721.

    Article  Google Scholar 

  • Christensen, R. (1987). The analysis of two-stage sampling data by ordinary least squares. Journal of American Statistical Association 82, 492–498.

    MathSciNet  Article  Google Scholar 

  • Cochran, W.G. (1977). Sampling Techniques. John Wiley & Sons, New York.

    MATH  Google Scholar 

  • Ekholm, A., Smith, P.W.F. and Mc Donald, J.W. (1995). Marginal regression analysis of a multivariate binary response. Biometrika 82, 847–854.

    MathSciNet  Article  Google Scholar 

  • Fay, R.E. and Herriot, R.A. (1979). Estimates of income for small places: An application of James-Stein procedures to census data. Journal of American Statistical Association 74, 269–277.

    MathSciNet  Article  Google Scholar 

  • Fuller, W.A. (2009). Sampling Statistics. John Wiley & Sons, New York.

    Book  Google Scholar 

  • Ghosh, M. (1991). Estimating functions in survey sampling : a review,.

  • Godambe, V.P. (1986). Parameters of super-population and survey population: Their relationships and estimation. International Statistical Review 54, 127–138.

    MathSciNet  Article  Google Scholar 

  • Hansen, M.H., Madow, W.G. and Tepping, B.J. (1983). An evaluation of model-dependent and probability-sampling inferences in sample surveys. Journal of American Statistical Association 78, 776–793.

    Article  Google Scholar 

  • Jiang, J. (1998). Consistent estimators in generalized linear mixed models. Journal of American Statistical Association 93, 720–729.

    MathSciNet  Article  Google Scholar 

  • Kennel, T. and Valliant, R. (2020). Multivariate logistic assisted estimators of totals from clustered survey samples. Journal of survey statistic and methodology, pp. 1–35.

  • Lee, S. E., Lee, P. R. and Shin, K. (2016). A composite estimator for stratified two stage cluster sampling. Communications for Statistical Applications and Methods 23, 47–55.

    Article  Google Scholar 

  • Lee, Y. and Nelder, J (1996). Hierarchical generalized linear models. Journal of Royal Statistical Society, B 58, 619–678.

    MathSciNet  MATH  Google Scholar 

  • Liang, K.-Y., Zeger, S.L. and Qaqish, B. (1992). Multivariate r egression analysis for categorical data. Journal of Royal Statistical Society, Series B54, 3–40.

    MATH  Google Scholar 

  • Molina, E.A., Smith, T.M.F. and Sugden, R.A. (2001). Modelling overdispersion for complex survey data. Int. Stat. Rev. 69, 373–384.

    Article  Google Scholar 

  • Nandram, B. and Sedransk, J. (1993). Bayesian predictive inference for a finite population proportion:, Two-stage cluster sampling. J. R. Statist. Soc. B.55, 399–408.

    MathSciNet  MATH  Google Scholar 

  • Nelder, J.A. and Wedderburn, R.W.M. (1972). Generalized linear models. Journal of the Royal Statistical Society. Series A 135, 370–384.

    Article  Google Scholar 

  • Prasad, N.G.N. and Rao, J.N.K. (1990). The estimation of the mean squared error of small-area estimators. Journal of American Statistical Association 85, 163–171.

    MathSciNet  Article  Google Scholar 

  • Pfeffermann, D. and Nathan, G. (1981). Regression analysis of data from a cluster sample. Journal of American Statistical Association 76, 681–689.

    Article  Google Scholar 

  • Rao, J. N. K., Sutradhar, B. C. and Yue, K. (1993). Generalized least squares F test in regression analysis with two-stage cluster samples. Journal of American Statistical Association 88, 1388–1391.

    MathSciNet  MATH  Google Scholar 

  • Rao, J.N.K. and Molina, I. (2015). Small Area Estimation. John Wiley & Sons, New York.

    Book  Google Scholar 

  • Roberts, G., Rao, J.N.K. and Kumar, S. (1987). Logistic regression analysis of sample survey data. Biometrika 74, 1–12.

    MathSciNet  Article  Google Scholar 

  • Särndal, C.-E., Swensson, B. and Wretman, J. (1992). Model Assisted Survey Sampling. Springer, New York.

    Book  Google Scholar 

  • Scott, A.J. and Holt, D. (1982). The effect of two-stage sampling o n ordinary least squares methods. Journal of American Statistical Association 77, 848–854.

    Article  Google Scholar 

  • Seber, G.A.F. (1984). Multivariate Observations. John Wiley & Sons, New York.

    Book  Google Scholar 

  • Skinner, C. (2019). Analysis of categorical data for complex surveys. Int. Stat. Rev. 87, S64–S78.

    MathSciNet  Article  Google Scholar 

  • Sutradhar, B.C. (2004). On exact quasi-likelihood inference in generalized linear mixed models. Sankhya B 66, 261–289.

    Google Scholar 

  • Sutradhar, B.C. (2008). Inferences in familial Poisson mixed models for survey data. Sankhya B 70, 18–33.

    MathSciNet  MATH  Google Scholar 

  • Sutradhar, B.C. (2011). Dynamic Mixed Models for Familial Longitudinal Data. Springer, New York.

    Book  Google Scholar 

  • Sutradhar, B.C. (2020). Multinomial logistic mixed models for clustered categorical data in a complex survey setup. Sankhya A, Available as online first article.

  • Sutradhar, B.C. and Mukerjee, R. (2005). On likelihood inference in binary mixed model with an application to COPD data. Computational Statistics and Data Analysis 48, 345–361.

    MathSciNet  Article  Google Scholar 

  • Ten have, T.R. and Morabia, A. (1999). Mixed effects models with bivariate and univariate association parameters for longitudinal bivariate binary response data. Biometrics 55, 85–93.

    Article  Google Scholar 

  • Trinkoff, A.M., Zhou, Q., Storr, C.L. and Soelken, K.L. (2000). Workplace access, negative proscriptions, job strain, and substance use in registered nurses. Nurs. Res. 49, 83–90.

    Article  Google Scholar 

  • Valliant, R. (1985). Nonlinear prediction theory and the estimation of proportions in a finite population. Journal of American Statistical Association 80, 631–641.

    MathSciNet  Article  Google Scholar 

  • Valliant, R. (1987). Generalized variance functions in stratified two-stage sampling. Journal of American Statistical Association 82, 499–508.

    MathSciNet  Article  Google Scholar 

  • Wu, C. F. J., Holt, D. and Holmes, D.J. (1998). The effect of two-stage sampling on the F statistics. Journal of American Statistical Association83, 150–159.

    MathSciNet  MATH  Google Scholar 

Download references


The author would like to thank two reviewers and the Associate Editor for their valuable comments and suggestions leading to the improvement of the paper.


No fund was used to complete this research.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Brajendra C. Sutradhar.

Ethics declarations

Conflict of Interests

There is no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


Appendix A: Computation of the Mixed Effects Based Marginal Mean (1.7), and Covariance Matrix (1.8)

Computation of unconditional marginal mean

Use \(\gamma ^{*}_{c}=\gamma _{c}/\sigma _{\gamma }\) in (1.2), and re-express the conditional mean as

$$ \begin{array}{@{}rcl@{}} \pi^{*}_{ci}(\beta,\sigma^{2}_{\gamma}, \gamma^{*}_{c})&=& =\frac{\exp(x^{\prime}_{ci}\beta+\sigma_{\gamma}\gamma^{*}_{c})}{[1+\exp(x^{\prime}_{ci}\beta+ \sigma_{\gamma}\gamma^{*}_{c})]}, \end{array} $$

where \(\gamma ^{*}_{c} {\stackrel {iid}{\sim }} N(0,1)\). One may them compute the unconditional mean as

$$ \begin{array}{@{}rcl@{}} E_{M}[Y_{ci}]&=&\pi_{ci}(\beta,\sigma^{2}_{\gamma})=\int \pi^{*}_{ci}(\beta,\sigma^{2}_{\gamma}, \gamma^{*}_{c})g_{N}(\gamma^{*}_{c})d\gamma^{*}_{c}, \end{array} $$

where \(g_{N}(\gamma ^{*}_{c})\) denotes the standard normal density.

Computation of unconditional covariance matrix

First, because yci is a binary response, the formula for its variance is written as

$$ \text{var}_{M}[Y_{ci}|x_{ci}]=\sigma_{c,ii}(\beta,\sigma^{2}_{\gamma}) =\pi_{ci}(\beta,\sigma^{2}_{\gamma})(1-\pi_{ci}(\beta,\sigma^{2}_{\gamma})), $$

where \(\pi _{ci}(\beta ,\sigma ^{2}_{\gamma })\) is the unconditional mean, given by (a.2).

Next, because given the cluster effect, the individuals within a cluster must be pair-wise independent, we write

$$ \begin{array}{@{}rcl@{}} &&\text{cov}_{M}[\{Y_{ci},Y_{cj}\}|x_{ci},x_{cj},\gamma_{c}]=0, \end{array} $$

implying that

$$ \begin{array}{@{}rcl@{}} E_{M}[[\{Y_{ci},Y_{cj}\}|\gamma_{c}]&=&E_{M}[Y_{ci}|\gamma_{c}]E_{M}[Y_{cj}|\gamma_{c}] \\ &=&\pi^{*}_{ci}(\beta,\sigma^{2}_{\gamma}, \gamma^{*}_{c})\pi^{*}_{cj}(\beta,\sigma^{2}_{\gamma}, \gamma^{*}_{c}). \end{array} $$

Hence, the unconditional covariance between yci and ycj, is given by

$$ \begin{array}{@{}rcl@{}} \text{cov}_{M}[\{Y_{ci},Y_{cj}\}|x_{ci},x_{cj}]&=&\sigma_{c,ij}(\beta, \sigma^{2}_{\gamma}) \\ &=&\lambda_{c,ij}(\beta,\sigma^{2}_{\gamma})-\pi_{ci}(\beta,\sigma^{2}_{\gamma}) \pi_{cj}(\beta,\sigma^{2}_{\gamma}), \end{array} $$


$$ \begin{array}{@{}rcl@{}} \lambda_{c,ij}(\beta,\sigma^{2}_{\gamma}) &=&E_{M}[Y_{ci}Y_{cj}]=E_{\gamma_{c}}E[\{Y_{ci}Y_{cj}\}|\gamma_{c}] \\ &=& E_{\gamma^{*}}[\pi^{*}_{ci}(\beta,\sigma^{2}_{\gamma}, \gamma^{*}_{c})\pi^{*}_{cj}(\beta,\sigma^{2}_{\gamma}, \gamma^{*}_{c})] \\ &=&\int \frac{\exp[(x_{ci}+x_{cj})'\beta+2\sigma_{\gamma} \gamma^{*}_{c}]}{[1+\exp(x^{\prime}_{ci}\beta+\sigma_{\gamma}\gamma^{*}_{c})] [1+\exp(x^{\prime}_{cj}\beta+\sigma_{\gamma}\gamma^{*}_{c})]}g_{N}(\gamma^{*}_{c})d\gamma^{*}_{c} \\ &=&\int \pi^{*}_{ci}(\beta,\sigma^{2}_{\gamma},\gamma^{*}_{c})\pi^{*}_{cj}(\beta,\sigma^{2}_{\gamma}, \gamma^{*}_{c}) g_{N}(\gamma^{*}_{c})d\gamma^{*}_{c}. \end{array} $$

Appendix B: Computation of the Covariance Matrix \(V^{*}_{n}(\beta ,\sigma ^{2}_{\gamma })\) in (3.33)

By applying the indicator variables from (3.21)-(3.22), we express the formula of this matrix from (3.33), as

$$ \begin{array}{@{}rcl@{}} &&V^{*}_{n}(\beta,\sigma^{2}_{\gamma}) =\text{cov}_{p_{1}}\left[\frac{K}{k}{\sum}^{K}_{c=1}\frac{N_{c}}{n_{c}} \delta_{1,c}E_{p_{2c}}\left\{{\sum}^{N_{c}}_{i=1} \delta_{2,i|c}z_{ci}|p_{1}\right\}\right] \\ &+&E_{p_{1}}\left[(K^{2}/k^{2}){\sum}^{K}_{c=1}\frac{{N^{2}_{c}}}{{n^{2}_{c}}}\delta_{1,c} \text{cov}_{p_{2c}}\left\{{\sum}^{N_{c}}_{i=1}\delta_{2,i|c}z_{ci}|p_{1}\right\} \right]. \end{array} $$

Computational formula for the first term in (b1)

Notice that for hypothetically known zci under the FP, the expectation with respect to the sampling design p2c, in the first term, may be computed as

$$ \begin{array}{@{}rcl@{}} E_{p_{2c}}\left\{{\sum}^{N_{c}}_{i=1} \delta_{2,i|c}z_{ci}|p_{1}\right\} &=&{\sum}^{N_{c}}_{i=1} E_{p_{2c}}[\delta_{2,i|c}]z_{ci}|p_{1} \\ &=&\frac{n_{c}}{N_{c}}{\sum}^{N_{c}}_{i=1}z_{ci}=\frac{n_{c}}{N_{c}}Z_{c}, \text{(say).} \end{array} $$

Because the first stage sample of clusters is chosen based on the SRS without replacement, by substituting (b.2) in the first term in (b.159), we can compute the covariance over the sampling design p1, as

$$ \begin{array}{@{}rcl@{}} &&\text{cov}_{p_{1}}\!\left[\frac{K}{k}{\sum}^{K}_{c=1} \delta_{1,c}Z_{c}\right] = \frac{K^{2}}{k^{2}}\!\left[{\sum}^{K}_{c=1}Z_{c}Z^{\prime}_{c}\text{var}[\delta_{1,c}] + {\sum}^{K}_{c \neq d}Z_{c}Z^{\prime}_{d} \text{cov}[\delta_{1,c},\delta_{1,d}]\right]\!. \end{array} $$

Now because δ1,c is the indicator variable as defined by (3.21) under the sampling design p1 (SRS without replacement), we have

$$ \begin{array}{@{}rcl@{}} \text{var}(\delta_{1,c})&=&\frac{k}{K}(1-\frac{k}{K})\\ \text{cov}(\delta_{1,c},\delta_{1,d})&=&E(\delta_{1,c}\delta_{1,d}) -E(\delta_{1,c})E(\delta_{1,d}) \\ &=&\frac{k(k-1)}{K(K-1)}-\left( \frac{k}{K}\right)^{2} =-\frac{k}{K(K-1)}(1-\frac{k}{K}). \end{array} $$

Substitute (b.4) in (b.5), and write

$$ \begin{array}{@{}rcl@{}} &&\text{cov}_{p_{1}}\left[\frac{K}{k}{\sum}^{K}_{c=1} \delta_{1,c}Z_{c}\right]=\frac{K^{2}}{k^{2}}\frac{k}{K}(1-\frac{k}{K})\left[ {\sum}^{K}_{c=1}Z_{c}Z^{\prime}_{c}-\frac{1}{K-1}{\sum}^{K}_{c \neq d}Z_{c}Z^{\prime}_{d}\right] \\ &=&\frac{K}{K-1}\frac{K}{k}(1-\frac{k}{K})\left[\frac{(K-1)}{K} {\sum}^{K}_{c=1}Z_{c}Z^{\prime}_{c}-\frac{1}{K}{\sum}^{K}_{c \neq d}Z_{c}Z^{\prime}_{d}\right] \\ &=&\frac{1}{k}\frac{K^{2}}{K-1}(1-\frac{k}{K})\left[{\sum}^{K}_{c=1}Z_{c}Z^{\prime}_{c} -\frac{1}{K}\left\{{\sum}^{K}_{c=1}Z_{c}Z^{\prime}_{c}+{\sum}^{K}_{c \neq d}Z_{c}Z^{\prime}_{d}\right\}\right] \\ &=&\frac{1}{k}\frac{K^{2}}{K-1}(1-\frac{k}{K})\left[{\sum}^{K}_{c=1}Z_{c}Z^{\prime}_{c} -\frac{1}{K}\left\{{\sum}^{K}_{c=1}Z_{c} {\sum}^{K}_{c=1}Z^{\prime}_{c}\right\}\right] \\ &=&\frac{1}{k}\frac{K^{2}}{K-1}(1-\frac{k}{K})\left[{\sum}^{K}_{c=1}(Z_{c}-\bar{Z}) (Z_{c}-\bar{Z})'\right] \\ &=&K^{2}\left( \frac{K-k}{K}\right)\frac{1}{k}V_{1\cdot}(\beta,\sigma^{2}_{\gamma}), \end{array} $$

where, we have used

$$\bar{Z}=\frac{1}{K}{\sum}^{K}_{c=1}Z_{c}, \text{and} V_{1\cdot}(\beta,\sigma^{2}_{\gamma})=\frac{1}{K-1} {\sum}^{K}_{c=1}(Z_{c}-\bar{Z})(Z_{c}-\bar{Z})'.$$

Computational formula for the second term in (b.1)

First we obtain the covariance matrix over the second stage sampling design p2c, as

$$ \begin{array}{@{}rcl@{}} &&\text{cov}_{p_{2c}}\left\{{\sum}^{N_{c}}_{i=1}\delta_{2,i|c}z_{ci}|p_{1}\right\} \\ &=&{\sum}^{N_{c}}_{i=1}\text{var}[\delta_{2,i|c}]z_{ci}z^{\prime}_{ci} + {\sum}^{N_{c}}_{i \neq j}\text{cov}[\delta_{2,i|c},\delta_{2,j|c}]z_{ci}z^{\prime}_{cj} \\ &=&\frac{n_{c}}{N_{c}}(1-\frac{n_{c}}{N_{c}}){\sum}^{N_{c}}_{i=1}z_{ci}z^{\prime}_{ci} -\frac{n_{c}}{N_{c}(N_{c}-1)}(1-\frac{n_{c}}{N_{c}}) {\sum}^{N_{c}}_{i \neq j}z_{ci}z^{\prime}_{cj}, \end{array} $$

by using the similar formula as in (b.4). Furthermore, by similar algebras as in (b.5), (b.6) reduces to

$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!&&\text{cov}_{p_{2c}}\left\{{\sum}^{N_{c}}_{i=1}\delta_{2,i|c}z_{ci}|p_{1}\right\} \\ \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!&=&\!\!\!\frac{n_{c}}{N_{c}}(1-\frac{n_{c}}{N_{c}})\frac{N_{c}}{N_{c}-1} \left[\frac{(N_{c}-1)}{N_{c}}{\sum}^{N_{c}}_{i=1}z_{ci}z^{\prime}_{ci}-\frac{1}{N_{c}} {\sum}^{N_{c}}_{i \neq j}z_{ci}z^{\prime}_{cj}\right] \\ \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!&=&\!\!\!(1-\frac{n_{c}}{N_{c}})\frac{n_{c}}{N_{c}-1}{\sum}^{N_{c}}_{i=1}(z_{ci} - \bar{Z}_{c}) (z_{ci} - \bar{Z}_{c})' = n_{c}(1 - \frac{n_{c}}{N_{c}})V^{*}_{c}(\beta,\sigma^{2}_{\gamma}), \text{(say),} \end{array} $$

where we have used \(\bar {Z}_{c}=\frac {Z_{c}}{N_{c}}=\frac {1}{N_{c}}{\sum }^{N_{c}}_{i=1}z_{ci}\). After putting (b.7) in the second term in (b.1), we take the desired expectation over the first stage sampling design p1, which yields the formula for the second term, as

$$ \begin{array}{@{}rcl@{}} &&E_{p_{1}}\left[(K^{2}/k^{2}){\sum}^{K}_{c=1}\frac{{N^{2}_{c}}}{{n^{2}_{c}}}\delta_{1,c} \text{cov}_{p_{2c}}\left\{{\sum}^{N_{c}}_{i=1}\delta_{2,i|c}z_{ci}|p_{1}\right\} \right] \\ &=&\left[(K^{2}/k^{2}){\sum}^{K}_{c=1}\frac{{N^{2}_{c}}}{{n^{2}_{c}}}E_{p_{1}}[\delta_{1,c}] n_{c}(1-\frac{n_{c}}{N_{c}})V^{*}_{c}(\beta,\sigma^{2}_{\gamma}) \right] \\ &=&\frac{K}{k}{\sum}^{K}_{c=1}{N^{2}_{c}}\frac{N_{c}-n_{c}}{N_{c}}\frac{1}{n_{c}} V^{*}_{c}(\beta,\sigma^{2}_{\gamma}). \end{array} $$

Finally by combining (b.5) and (b.8), we obtain the covariance matrix \(V^{*}_{n}(\beta ,\sigma ^{2}_{\gamma }),\) in (b.1), which is reported in (3.34), under Section 3.2.2.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sutradhar, B.C. Cluster Correlations and Complexity in Binary Regression Analysis Using Two-stage Cluster Samples. Sankhya A (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Cluster correlation effects
  • Consistency
  • Doubly weighted estimation
  • Finite population based estimating equations
  • Mixed effects based proportion
  • Regression parameters in proportion
  • Two-stage cluster sampling.

Mathematics Subject Classification (2010)

  • Primary 62F10
  • 62H20; Secondary 62F12