Abstract
In a two-stage cluster sampling setup for binary data, a sample of clusters such as hospitals is chosen at the first stage from a large number of clusters belonging to a finite population, and in the second stage a random sample of individuals such as nurses is chosen from the selected cluster and the binary responses along with covariates are collected from the selected individuals. Because the hypothetical binary responses from the individuals in a given cluster/hospital under the first stage sample are correlated (as they share a common cluster effect), this correlation plays a complex role in developing the second stage sample based estimating equations for the underlying regression parameters. Moreover, the correlation parameters have to be consistently estimated too. In this paper, unlike the existing studies, we demonstrate how to accommodate (1) the so-called inverse correlation weights arising from a finite population based generalized quasi-likelihood (GQL) estimating function, on top of (2) the sampling weights, to develop a survey sample based doubly weighted (SSDW) estimation approach, for consistent estimation of both regression and correlation parameters. For simplicity, we refer to this GQL cum SSDW approach as the SSDW approach only. The method of moments (MM) cum SSDW approach will be simpler but less efficient, which is not included in the paper. The estimating function involved in the proposed SSDW estimating equation has the form of a sample total, which unbiasedly estimate the corresponding finite population total that arises from the aforementioned generalized quasi-likelihood function for the targeted finite population parameter. The resulting SSDW estimators, thus, become consistent for the respective parameters. This consistency property for the SSDW estimator for both regression and cluster correlation parameters is studied in details.
Similar content being viewed by others
References
Binder, D. (1983). On the variances of asymptotically normal estimators from complex surveys. Int. Stat. Rev. 51, 279–292.
Binder, D. and Roberts, G. (2009). Design-and model based inference for model parameters,.
Breslow, N.E. (1993). Approximate inference in generalized linear mixed models. Journal of American Statistical Association 88, 9–25.
Burdick, R.K. and Sielken Jr, R.L. (1979). Variance estimation based on superpopulation model in two-stage sampling. Journal of American Statistical Association 74, 438–440.
Burgard, J.P. and Dörr, P. (2021). Generalized Linear Mixed Models with Crossed Effects and Unit-specific Survey Weights. Journal of Computational and Graphical Statistics. https://doi.org/10.1080/10618600.2021.2001342.
Christensen, R. (1984). A note on ordinary least squares methods for two-stage sampling. Journal of American Statistical Association 79, 720–721.
Christensen, R. (1987). The analysis of two-stage sampling data by ordinary least squares. Journal of American Statistical Association 82, 492–498.
Cochran, W.G. (1977). Sampling Techniques. John Wiley & Sons, New York.
Ekholm, A., Smith, P.W.F. and Mc Donald, J.W. (1995). Marginal regression analysis of a multivariate binary response. Biometrika 82, 847–854.
Fay, R.E. and Herriot, R.A. (1979). Estimates of income for small places: An application of James-Stein procedures to census data. Journal of American Statistical Association 74, 269–277.
Fuller, W.A. (2009). Sampling Statistics. John Wiley & Sons, New York.
Ghosh, M. (1991). Estimating functions in survey sampling : a review,.
Godambe, V.P. (1986). Parameters of super-population and survey population: Their relationships and estimation. International Statistical Review 54, 127–138.
Hansen, M.H., Madow, W.G. and Tepping, B.J. (1983). An evaluation of model-dependent and probability-sampling inferences in sample surveys. Journal of American Statistical Association 78, 776–793.
Jiang, J. (1998). Consistent estimators in generalized linear mixed models. Journal of American Statistical Association 93, 720–729.
Kennel, T. and Valliant, R. (2020). Multivariate logistic assisted estimators of totals from clustered survey samples. Journal of survey statistic and methodology, pp. 1–35.
Lee, S. E., Lee, P. R. and Shin, K. (2016). A composite estimator for stratified two stage cluster sampling. Communications for Statistical Applications and Methods 23, 47–55.
Lee, Y. and Nelder, J (1996). Hierarchical generalized linear models. Journal of Royal Statistical Society, B 58, 619–678.
Liang, K.-Y., Zeger, S.L. and Qaqish, B. (1992). Multivariate r egression analysis for categorical data. Journal of Royal Statistical Society, Series B54, 3–40.
Molina, E.A., Smith, T.M.F. and Sugden, R.A. (2001). Modelling overdispersion for complex survey data. Int. Stat. Rev. 69, 373–384.
Nandram, B. and Sedransk, J. (1993). Bayesian predictive inference for a finite population proportion:, Two-stage cluster sampling. J. R. Statist. Soc. B.55, 399–408.
Nelder, J.A. and Wedderburn, R.W.M. (1972). Generalized linear models. Journal of the Royal Statistical Society. Series A 135, 370–384.
Prasad, N.G.N. and Rao, J.N.K. (1990). The estimation of the mean squared error of small-area estimators. Journal of American Statistical Association 85, 163–171.
Pfeffermann, D. and Nathan, G. (1981). Regression analysis of data from a cluster sample. Journal of American Statistical Association 76, 681–689.
Rao, J. N. K., Sutradhar, B. C. and Yue, K. (1993). Generalized least squares F test in regression analysis with two-stage cluster samples. Journal of American Statistical Association 88, 1388–1391.
Rao, J.N.K. and Molina, I. (2015). Small Area Estimation. John Wiley & Sons, New York.
Roberts, G., Rao, J.N.K. and Kumar, S. (1987). Logistic regression analysis of sample survey data. Biometrika 74, 1–12.
Särndal, C.-E., Swensson, B. and Wretman, J. (1992). Model Assisted Survey Sampling. Springer, New York.
Scott, A.J. and Holt, D. (1982). The effect of two-stage sampling o n ordinary least squares methods. Journal of American Statistical Association 77, 848–854.
Seber, G.A.F. (1984). Multivariate Observations. John Wiley & Sons, New York.
Skinner, C. (2019). Analysis of categorical data for complex surveys. Int. Stat. Rev. 87, S64–S78.
Sutradhar, B.C. (2004). On exact quasi-likelihood inference in generalized linear mixed models. Sankhya B 66, 261–289.
Sutradhar, B.C. (2008). Inferences in familial Poisson mixed models for survey data. Sankhya B 70, 18–33.
Sutradhar, B.C. (2011). Dynamic Mixed Models for Familial Longitudinal Data. Springer, New York.
Sutradhar, B.C. (2020). Multinomial logistic mixed models for clustered categorical data in a complex survey setup. Sankhya A, Available as online first article.
Sutradhar, B.C. and Mukerjee, R. (2005). On likelihood inference in binary mixed model with an application to COPD data. Computational Statistics and Data Analysis 48, 345–361.
Ten have, T.R. and Morabia, A. (1999). Mixed effects models with bivariate and univariate association parameters for longitudinal bivariate binary response data. Biometrics 55, 85–93.
Trinkoff, A.M., Zhou, Q., Storr, C.L. and Soelken, K.L. (2000). Workplace access, negative proscriptions, job strain, and substance use in registered nurses. Nurs. Res. 49, 83–90.
Valliant, R. (1985). Nonlinear prediction theory and the estimation of proportions in a finite population. Journal of American Statistical Association 80, 631–641.
Valliant, R. (1987). Generalized variance functions in stratified two-stage sampling. Journal of American Statistical Association 82, 499–508.
Wu, C. F. J., Holt, D. and Holmes, D.J. (1998). The effect of two-stage sampling on the F statistics. Journal of American Statistical Association83, 150–159.
Acknowledgments
The author would like to thank two reviewers and the Associate Editor for their valuable comments and suggestions leading to the improvement of the paper.
Funding
No fund was used to complete this research.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
There is no conflict of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Computation of the Mixed Effects Based Marginal Mean (1.7), and Covariance Matrix (1.8)
Computation of unconditional marginal mean
Use \(\gamma ^{*}_{c}=\gamma _{c}/\sigma _{\gamma }\) in (1.2), and re-express the conditional mean as
where \(\gamma ^{*}_{c} {\stackrel {iid}{\sim }} N(0,1)\). One may them compute the unconditional mean as
where \(g_{N}(\gamma ^{*}_{c})\) denotes the standard normal density.
Computation of unconditional covariance matrix
First, because yci is a binary response, the formula for its variance is written as
where \(\pi _{ci}(\beta ,\sigma ^{2}_{\gamma })\) is the unconditional mean, given by (a.2).
Next, because given the cluster effect, the individuals within a cluster must be pair-wise independent, we write
implying that
Hence, the unconditional covariance between yci and ycj, is given by
where
Appendix B: Computation of the Covariance Matrix \(V^{*}_{n}(\beta ,\sigma ^{2}_{\gamma })\) in (3.33)
By applying the indicator variables from (3.21)-(3.22), we express the formula of this matrix from (3.33), as
Computational formula for the first term in (b1)
Notice that for hypothetically known zci under the FP, the expectation with respect to the sampling design p2c, in the first term, may be computed as
Because the first stage sample of clusters is chosen based on the SRS without replacement, by substituting (b.2) in the first term in (b.159), we can compute the covariance over the sampling design p1, as
Now because δ1,c is the indicator variable as defined by (3.21) under the sampling design p1 (SRS without replacement), we have
Substitute (b.4) in (b.5), and write
where, we have used
Computational formula for the second term in (b.1)
First we obtain the covariance matrix over the second stage sampling design p2c, as
by using the similar formula as in (b.4). Furthermore, by similar algebras as in (b.5), (b.6) reduces to
where we have used \(\bar {Z}_{c}=\frac {Z_{c}}{N_{c}}=\frac {1}{N_{c}}{\sum }^{N_{c}}_{i=1}z_{ci}\). After putting (b.7) in the second term in (b.1), we take the desired expectation over the first stage sampling design p1, which yields the formula for the second term, as
Finally by combining (b.5) and (b.8), we obtain the covariance matrix \(V^{*}_{n}(\beta ,\sigma ^{2}_{\gamma }),\) in (b.1), which is reported in (3.34), under Section 3.2.2.
Rights and permissions
About this article
Cite this article
Sutradhar, B.C. Cluster Correlations and Complexity in Binary Regression Analysis Using Two-stage Cluster Samples. Sankhya A 85, 829–884 (2023). https://doi.org/10.1007/s13171-022-00281-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13171-022-00281-8
Keywords
- Cluster correlation effects
- Consistency
- Doubly weighted estimation
- Finite population based estimating equations
- Mixed effects based proportion
- Regression parameters in proportion
- Two-stage cluster sampling.