Response-Dependent Sampling with Clustered and Longitudinal Data

  • Michael A. McIsaacEmail author
  • Richard J. Cook
Conference paper
Part of the Lecture Notes in Statistics book series (LNS, volume 211)


Prospective cohort studies typically involve repeated assessment of individuals to determine whether they have a particular health condition. The usual goal in such studies is to relate the presence of the condition to disease markers or exposure variables. Disease markers are often too difficult or costly to measure for all individuals in a sample. In such settings, two- and multi-phase sampling designs are routinely adopted to enable researchers to select individuals on whom these expensive markers are to be assessed. In this article we review the rationale and format of two-phase sampling designs in retrospective and cross-sectional studies. We then develop frameworks for multi-phase designs in the context of studies with clustered or longitudinal responses. Model-based and semi-parametric methods are discussed for estimation and inference.


Augmented inverse probability weighting asymptotic efficiency inverse probability weighting maximum likelihood response-dependent sampling two-phase sampling; clustering longitudinal data. 



Michael McIsaac’s research was supported by an Alexander Graham Bell Canada Graduate Scholarship from the Natural Sciences and Engineering Research Council of Canada (NSERC) and Discovery Grants to Richard Cook from NSERC (RGPIN 155849) and the Canadian Institutes for Health Research (FRN 13887). Richard Cook is a Canada Research Chair in Statistical Methods for Health Research. The authors thank Dr. Dafna Gladman and Dr. Vinod Chandran for collaboration and helpful discussions regarding the research at the Centre for Prognosis Studies in Rheumatic Disease at the University of Toronto. The authors gratefully acknowledge the careful review and comments from a referee and Dr. Brajendra Sutradhar.


  1. Breslow, N.E., Cain, K.C.: Logistic regression for two-stage case-control data. Biometrika. 75(1), 11–20 (1988)MathSciNetzbMATHCrossRefGoogle Scholar
  2. Breslow, N.E., Chatterjee, N.: Design and analysis of two-phase studies with binary outcome applied to wilms tumour prognosis. Appl. Stat. 48(4), 457–468 (1999)zbMATHGoogle Scholar
  3. Chandran, V., Tolusso, D.C., Cook, R.J., Gladman, D.D.: Risk factors for axial inflammatory arthritis in patients with psoriatic arthritis. J. Rheumatol. 37(4), 809–815 (2010)CrossRefGoogle Scholar
  4. Chatterjee, N., Chen, Y., Breslow, N.E.: A pseudoscore estimator for regression problems with two-phase sampling. J. Am. Stat. Assoc. 98(461), 158–168 (2003)MathSciNetzbMATHCrossRefGoogle Scholar
  5. Cox, D.R., Hinkley, D.V.: Theoretical Statistics. Chapman & Hall, London (1974)zbMATHCrossRefGoogle Scholar
  6. del Rincon, I., Williams, K., Stern, M.P., Freeman, G.L., O’Leary, D.H., Escalante, A.: Association between carotid atherosclerosis and markers of inflammation in rheumatoid arthritis patients and healthy subjects. Arthritis Rheum. 48(7), 1833–1840 (2003)CrossRefGoogle Scholar
  7. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B 39(1), 1–38 (1977)MathSciNetzbMATHGoogle Scholar
  8. Heagerty, P.J., Zeger, S.L.: Marginalized multilevel models and likelihood inference. Stat. Sci. 15, 1–26 (2000)MathSciNetGoogle Scholar
  9. Heagerty, P.J.: Marginalized transition models and likeliood inference for longitudinal categorical data. Biometrics 58(2), 342–351 (2002).MathSciNetzbMATHCrossRefGoogle Scholar
  10. Horton, N.J., Laird, N.M.: Maximum likelihood analysis of logistic regression models with incomplete covariate data and auxiliary information. Biometrics 57, 34–42 (2001).MathSciNetzbMATHCrossRefGoogle Scholar
  11. Lawless, J.F., Kalbfleisch, J.D., Wild, C.J.: Semiparametric methods for response-selective and missing data problems in eegression. J. Roy. Stat. Soc. B 61(2), 413–438 (1999)MathSciNetzbMATHCrossRefGoogle Scholar
  12. Laird, N., Ware, J.H.: Random-effects models for longitudinal data. Biometrics 38(4), 963–974 (1982)zbMATHCrossRefGoogle Scholar
  13. Liang, K.Y., Zeger, S.L.: Longitudinal data analysis using generalized linear models. Biometrika 73(1), 13–22 (1986)MathSciNetzbMATHCrossRefGoogle Scholar
  14. Lipsitz, S.R., Laird, N.M., Harrington, D.P.: Generalized estimating equations for correlated binary data: using the odds ratio as a measure of association. Biometrika 78(1), 153–160 (1991)MathSciNetCrossRefGoogle Scholar
  15. Little, R.J.A., Rubin, D.B.: Statistical analysis with missing data, 2nd edn. Wiley, New York (2002)zbMATHGoogle Scholar
  16. Neuhaus, J.M.: Statistical methods for longitudinal and clustered designs with binary responses. Stat. Meth. Med. Res. 1, 249–273 (1992)CrossRefGoogle Scholar
  17. Pepe, M.S., Reilly, M., Fleming, T.R.: Auxiliary outcome data and the mean-score method. J. Stat. Plann. Infer. 42, 137–160 (1994)MathSciNetzbMATHCrossRefGoogle Scholar
  18. Pickles, A., Dunn, G., Vazquez-Barquero, J.L.: Screening for stratification in two-phase (“two-stage”) epidemiological surveys. Stat. Meth. Med. Res. 4, 73–89 (1995)CrossRefGoogle Scholar
  19. Prentice, R.L.: Correlated binary regression with covariates specific to each binary observation. Biometrics 44(4), 1033–1048 (1988)MathSciNetzbMATHCrossRefGoogle Scholar
  20. Rahman, P., Gladman, D.D., Cook, R.J., Zhou, Y., Young, G., Salonen, D.: Radiological assessment in psoriatic arthritis. Rheumatology 37(7), 760–765 (1998)CrossRefGoogle Scholar
  21. Raina, P.S, Wolfson, C., Kirkland, S.A., Griffith, L.E., Oremus, M., Patterson, C., Tuokko, H., Penning, M., Balion, C.M., Hogan, D., Wister, A., Payette, H., Shannon, H., Brazil, K.: The Canadian longitudinal study on aging (CLSA). Can. J. Aging 28(3), 221–229 (2009)CrossRefGoogle Scholar
  22. Reilly, M.: Optimal sampling strategies for two phase studies. Am. J. Epidemiol. 143, 92–100 (1996)CrossRefGoogle Scholar
  23. Reilly, M., Pepe, M.S.: A mean score method for missing and auxiliary covariate data in regression models. Biometrika 82(2), 299–314 (1995)MathSciNetzbMATHCrossRefGoogle Scholar
  24. Robins, J.M., Rotnitzky, A., Zhao, L.P.: Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 89(427), 846–866 (1994)MathSciNetzbMATHCrossRefGoogle Scholar
  25. Robins, J.M., Rotnitzky, A., Zhao, L.P.: Analysis of semiparametric regression models for repeated outcomes in the presence of Missing Data. J. Am. Stat. Assoc. 90(429), 106–121 (1995)MathSciNetzbMATHCrossRefGoogle Scholar
  26. Stiratelli, R., Laird, N., Ware, J.H.: Random-effects models for serial observations with binary response. Biometrics 40(4), 961–971 (1984)CrossRefGoogle Scholar
  27. Sutradhar, B.C.: On auto-regression type dynamic mixed models for binary panel data. Metron 66(2), 209–221 (2008)Google Scholar
  28. Sutradhar, R., Cook, R.J.: A bivariate mover-stayer model for interval-censored recurrent event data: application to joint damage in rheumatology. Comm. Stat. Theor. Meth. 18, 3389–3405 (2009)MathSciNetCrossRefGoogle Scholar
  29. Tolusso, D.C., Cook, R.J.: Robust estimation of state occupancy probabilities for interval-censored multistate data: an application involving spondylitis in psoriatic arthritis. Comm. Stat. Theor. Meth. 38(18), 3307–3325 (2009)MathSciNetzbMATHCrossRefGoogle Scholar
  30. Troxel, A.B., Lipsitz, S.R., Brennan, T.A.: Weighted estimating equations with nonignorable nonresponse data. Biometrics 53(3), 857–869 (1997)zbMATHCrossRefGoogle Scholar
  31. Tsiatis, A.A.: Semiparametric Theory and Missing Data. Springer, New York (2006)zbMATHGoogle Scholar
  32. Whittemore, A.S., Halpern, J.: Multi-stage sampling in genetic epidemiology. Stat. Med. 16, 153–167 (1997)CrossRefGoogle Scholar
  33. Zeng, L., Cook, R.J.: Transition models for multivariate longitudinal binary data. J. Am. Stat. Assoc. 102, 211–223 (2007)MathSciNetzbMATHCrossRefGoogle Scholar
  34. Zhao, L.P., Prentice, R.L.: Correlated binary regression using a quadratic exponential model. Biometrika 77(3), 642–648 (1990)MathSciNetCrossRefGoogle Scholar
  35. Zhao, Y.: Design and efficient estimation in regression analysis with missing data in two-phase studies. PhD thesis, University of Waterloo (2005)Google Scholar
  36. Zhao, Y., Lawless, J.F., McLeish, D.L.: Likelihood methods for pegression models with expensive variables missing by design. Biom. J. 51(1), 123–136 (2009)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Department of Statistics and Actuarial ScienceUniversity of WaterlooWaterlooCanada

Personalised recommendations