Abstract
For large cohort studies with rare outcomes, the nested case-control design only requires data collection of small subsets of the individuals at risk. These are typically randomly sampled at the observed event times and a weighted, stratified analysis takes over the role of the full cohort analysis. Motivated by observational studies on the impact of hospital-acquired infection on hospital stay outcome, we are interested in situations, where not necessarily the outcome is rare, but time-dependent exposure such as the occurrence of an adverse event or disease progression is. Using the counting process formulation of general nested case-control designs, we propose three sampling schemes where not all commonly observed outcomes need to be included in the analysis. Rather, inclusion probabilities may be time-dependent and may even depend on the past sampling and exposure history. A bootstrap analysis of a full cohort data set from hospital epidemiology allows us to investigate the practical utility of the proposed sampling schemes in comparison to a full cohort analysis and a too simple application of the nested case-control design, if the outcome is not rare.
Similar content being viewed by others
References
Aalen OO, Borgan Ø, Gjessing HK (2008) Survival and event history analysis. Springer, New York
Andersen PK, Keiding N (2012) Interpretability and importance of functionals in competing risks and multistate models. Stat Med 31(11–12):1074–1088
Andersen PK, Borgan Ø, Gill RD, Keiding N (1993) Statistical models based on counting processes. Springer, New York
Bang CN, Gislason GH, Greve AM, Bang CA, Lilja A, Torp-Pedersen C, Andersen PK, Køber L, Devereux RB, Wachtell K (2014) New-onset atrial fibrillation is associated with cardiovascular events leading to death in a first time myocardial infarction population of 89 703 patients with long-term follow-up: a nationwide study. J Am Heart Assoc 3(1):e000382
Beyersmann J, Gastmeier P, Grundmann H, Bärwolff S, Geffers C, Behnke M, Rüden H, Schumacher M (2006) Use of multistate models to assess prolongation of intensive care unit stay due to nosocomial infection. Infect Control Hosp Epidemiol 27(5):493–499
Beyersmann J, Wolkewitz M, Schumacher M (2008) The impact of time-dependent bias in proportional hazards modelling. Stat Med 27(30):6439–6454
Beyersmann J, Allignol A, Schumacher M (2012) Competing risk and multistate models with R. Springer, New York
Borgan Ø, Keogh RH (2015) Nested case–control studies: should one break the matching? Lifetime Data Anal 21(4):517–541
Borgan Ø, Samuelsen SO (2013) Nested case-control and case-cohort studies. In: Klein JP, van Houwelingen HC, Ibrahim JG, Scheike TH (eds) Handbook of survival analysis. Chapman & Hall/CRC, Boca Raton, pp 343–367
Borgan Ø, Goldstein L, Langholz B (1995) Methods for the analysis of sampled cohort data in the Cox proportional hazards model. Ann Stat 23(5):1749–1778
Borgan Ø, Langholz B, Samuelsen SO, Goldstein L, Pogoda J (2000) Exposure stratified case-cohort designs. Lifetime Data Anal 6(1):39–58
Breslow NE (2014) Lessons in biostatistics. In: Lin X, Genest C, Banks DL, Molenberghs G, Scott DW, Wang JL (eds) Past, present and future of statistical science. Chapman and Hall/CRC, Boca Raton, pp 335–347
Breslow NE, Wellner JA (2007) Weighted likelihood for semiparametric models and two-phase stratified samples, with application to Cox regression. Scand J Stat 34(1):86–102
Cox DR (1972) Regression models and life-tables. J R Stat Soc 34(2):187–220
Essebag V, Platt RW, Abrahamowicz M, Pilote L (2005) Comparison of nested case–control and survival analysis methodologies for analysis of time-dependent exposure. BMC Med Res Methodol 5(1):5
García Rodríguez LA, Soriano-Gabarró M, Bromley S, Lanas A, Cea Soriano L (2017) New use of low-dose aspirin and risk of colorectal cancer by stage at diagnosis: a nested case–control study in UK general practice. BMC Cancer 17(1):637
Goldstein L, Langholz B (1992) Asymptotic theory for nested case–control sampling in the Cox regression model. Ann Stat 20(4):1903–1928
Grundmann H, Glasner C, Albiger B, Aanensen DM, Tomlinson CT, Andrasević AT, Cantón R, Carmeli Y, Friedrich AW, Giske CG, Glupczynski Y, Gniadkowski M, Livermore DM, Nordmann P, Poirel L, Rossolini GM, Seifert H, Vatopoulos A, Walsh T, Woodford N, Monnet DL (2017) Occurrence of carbapenemase-producing Klebsiella pneumoniae and Escherichia coli in the European survey of carbapenemase-producing Enterobacteriaceae (EuSCAPE): a prospective, multinational study. Lancet Infect Dis 17(2):153–163
Gutiérrez-Gutiérrez B, Sojo-Dorado J, Bravo-Ferrer J, Cuperus N, de Kraker M, Kostyanev T, Raka L, Daikos G, Feifel J, Folgori L, Pascual A, Goossens H, O’Brien S, Bonten MJM, Rodríguez-Baño J (2017) European prospectivecohort study on Enterobacteriaceae showing REsistance to Carbapenems (EURECA): a protocol of a European multicentre observational study. BMJ Open 7(4):e015365
Keogh RH, Cox DR (2014) Case–control studies. Institute of Mathematical Statistics Monographs. Cambridge University Press, Cambridge
Keogh RH, Mangtani P, Rodrigues L, Nguipdop Djomo P (2016) Estimating time-varying exposure-outcome associations using case–control data: logistic and case-cohort analyses. BMC Med Res Methodol 16(1):2
Kessing LV, Gerds TA, Knudsen NN, Jørgensen LF, Kristiansen SM, Voutchkova D, Ernstsen V, Schullehner J, Hansen B, Andersen PK, Ersbøll AK (2017) Association of lithium in drinking water with the incidence of dementia. JAMA Psychiatry 74(10):1005–1010
Langholz B, Borgan Ø (1995) Counter-matching: a stratified nested case–control sampling method. Biometrika 82(1):69–79
Langholz B, Clayton D (1994) Sampling strategies in nested case–control studies. Environ Health Perspect 102:47–51
Leffondre K, Wynant W, Cao Z, Abrahamowicz M, Heinze G, Siemiatycki J (2010) A weighted Cox model for modelling time-dependent exposures in the analysis of case–control studies. Stat Med 29(7–8):839–850
Lin D (2000) On fitting Cox’s proportional hazards models to survey data. Biometrika 87(1):37–47
Lumley T (2011) Complex surveys: a guide to analysis using R. Wiley Series in Survey Methodology. Wiley, New York
Oakes D (1981) Survival times: aspects of partial likelihood. Int Stat Rev 49(3):235–252
Ohneberg K, Wolkewitz M, Beyersmann J, Palomar-Martinez M, Olaechea-Astigarraga P, Alvarez-Lerma F, Schumacher M (2015) Analysis of clinical cohort data using nested case–control and case-cohort sampling designs. Methods Inf Med 54(6):505–514
Paixão ES, da Conceição N, Costa M, Teixeira MG, Harron K, de Almeida ME, Barreto ML, Rodrigues LC (2017) Symptomatic dengue infection during pregnancy and the risk of stillbirth in Brazil, 2006–12: a matched case–control study. Lancet Infect Dis 17(9):957–964
Pang D (1999) A relative power table for nested matched case–control studies. Occup Environ Med 56(1):67–69
Prentice RL (1986) A case–cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 73(1):1–11
R Core Team (2017) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Thomas DC (1977) Addendum to ‘methods of cohort analysis: appraisal by application to asbestos mining’ by Liddell, Francis. D. K. and Mcdonald, John C. and Thomas, Duncan C. and Cunliffe, Stella V. J R Stat Soc 140:483–485
Wolkewitz M, Beyersmann J, Gastmeier P, Martin S (2009) Efficient risk set sampling when a time-dependent exposure is present. Methods Inf Med 48(5):438–443
World Health Organization (WHO) (2014) Antimicrobial resistance: global report on surveillance. http://www.who.int/drugresistance/documents/surveillancereport/en/. Accessed 14 Dec 2017
Acknowledgements
This work was supported by Grant BE-4500/1-2 of the German Research Foundation (DFG).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendices
A Theoretical background
The following presentation is based on Borgan et al. (1995).
Fix \([0, \tau ]\), \(\tau \in (0,\infty ]\) and consider a cohort \(\mathscr {C}=\{1,\ldots ,n\}\) and a probability space \((\varOmega , \mathscr {F}, \mathbb {P})\). On this space, the marked point process \(\{(t_j, i_j), j\ge 1\}\), consists of the failure times \(t_j\in [0,\tau ]\) and the mark \(i_j\in \mathscr {C}\), that typifies the outcome positive individual at this point in time. The information at time-origin and generated by this process over time is included in the filtration \((\mathscr {F}_{t})_{t\ge 0}\). We assume the counting process of the observed failure times to be
and its intensity is \(\lambda _i(t)=Y_i(t)\alpha _0(t)\exp ({\varvec{\beta }}^T\mathbf {z}_i(t))\). The at-risk indicator \(Y_i\) and the covariates \(\mathbf {z}_i(t)\) are assumed to be left-continuous and adapted to \((\mathscr {F}_{t})_{t\ge 0}\). Let \({\widetilde{\mathscr {R}}}_j\) denote the sampled risk set consisting of the controls together with the matched outcome positive individual \(i_j\). The resulting marked point process is \(\{(t_j, (i_j, {\widetilde{\mathscr {R}}}_j)), j\ge 1\}\) with the finite mark space \(E=\{(i,{R}){:}\,{R} \in \mathscr {P}, i \in {R} \}=\{(i,{R}){:}\,{R} \subset \mathscr {C}, {R} \in \mathscr {P}_i \}\), where \(\mathscr {P}\) is the powerset of \(\mathscr {C}\). Using this construction, we extend \((\mathscr {F}_{t})_{t\ge 0}\) to \((\mathscr {H}_{t})_{t\ge 0}=(\mathscr {F}_{t}\vee \sigma ({\widetilde{\mathscr {R}}}_j; t_j\le t))_{t\ge 0}\) by the sampling process. For each tuple \((i,{R})\in E\), there exists a corresponding counting process
that counts all observed failure times of individual i in [0, t] within the sampled risk set \({R} \). We assume that the sampling procedure is independent in the sense that the \(\mathscr {F}_{t}-\)intensity processes of \(N_{i}(t):=\sum _{{R} \in \mathscr {P}_i}N_{(i,R)}(t)\) and their counterparts w.r.t \(\mathscr {H}_{t}\) coincide (Andersen et al. 1993, Sec. III.2). Using \(\pi \left( {R} |t,i\right) :=\mathbb {P}\left( {R} \text { sampled at } t|dN_i(t)=1, \mathscr {H}_{t-}\right) \) with \(\pi \left( {R} |t,i\right) =0\) if \(Y_i(t)=0\) and \(\pi \left( {R} |t,i\right) =0\) if \(i\notin {R} \) we obtain
as the intensity process for the counting process (7), where \(\pi \left( {R} |t\right) =n_\bullet ^{-1}(t)\sum _{i=1}^{n}\pi \left( {R} |t,i\right) \) and \(w_i(t,{R})={\pi \left( {R} |t,i\right) }/{\pi \left( {R} |t\right) }\) characterizes the weight. The inference is based on the partial likelihood given by
In conclusion, only outcome positive individuals with their respective time of failure \(t_i\) as well as the corresponding sampled risk sets \({\widetilde{\mathscr {R}}}_i\) contribute to the inference based on this model. The estimator \({\widehat{{\varvec{\beta }}}}\) is obtained by maximizing the partial likelihood (9). Theoretical properties and asymptotic results can be obtained from Borgan et al. (1995).
The question of how to specify \(\pi \left( \cdot |t,i\right) \) for every t where \(Y_i(t)=1\) arouses immediately. The probability can be based on information available until but not including time t, i.e. \(\pi \left( {R} |t, i\right) \) is left-continuous and adapted. Using this, we develop the new sampling procedure for investigating the association between a time-dependent exposure and the outcome by simultaneously sampling with respect to this exposure.
For the NECC, we consider a random variable \(B{:}\,\left( \varOmega ,[0,\tau ]\right) \rightarrow \{0,1\}\) indicating whether an individual should be considered as a case within the partial likelihood, i.e. whether controls should be assigned to that observed outcome event. Further, we write \(B(t):=B(\omega , t)\), \(\mathscr {R}_\bullet (t)=\{i{:}\,Y_i(t\text {-})=1\}\) and \(\mathbb {P}_{t\text {-}}({\widetilde{\mathscr {R}}}(t)={R}):=\mathbb {P}({\widetilde{\mathscr {R}}}(t)={R} |dN_i(t)=1, \mathscr {H}_{t\text {-}})\). We consider the failure time \(t_j\) and assume that for the respective individual the sampled risk set only contains \(i_j\), i.e. no controls are sampled. Thus, the contribution to the likelihood is then given by
Formalizing the NECC sampling design which was discussed in Sect. 2 we obtain as the sampled risk sets
Using \({\widetilde{\mathscr {R}}}_j\) and defining \(\mathscr {R}_0(t)=\{i:Y_i(t\text {-})=1, x_i(t)=0\}\) and \(n_\bullet (t)={\text {card}}\left( \mathscr {R}_\bullet (t)\right) \), we derive
where \(\mathbb {P}_{t\text {-}}(x_i(t\text {-})=a)=\mathbb {1} \left\{ x_i(t\text {-})=a \right\} \) for \(a\in \{0,1\}\). Equation (10) can be used for the calculation of the denominator of the weight. The structure of the random variable B allows for several sampling procedures within the NCC.
We choose \(B(t)\sim \text {Ber}(q(t))\), i.e. independently Bernoulli distributed with probability \(q(t)\in (0,1]\). In the simplest setting, \(q(t)\) is deterministic from the very beginning. The sampling probabilities and weights can be calculated with \(m_0(t)={\text {card}}\left( \mathscr {R}_0(t)\cap {\widetilde{\mathscr {R}}}(t)\right) \) by
In Sect. 2.4 we set \(q=1\) to state the history-dependent sampling scheme. This contradicts the requirement above, since \(q(t)=0\) if the inequality in (6b) is fulfilled. A motivation for excluding zero from the interval of the inclusion probability is as follows: Let \(q(t)=0\) for some t. Then weights take the form
meaning whenever a risk set has the same exposure value, the weight in (11a) will be infinite. If there are different exposure levels in a risk set, the set will be uninformative [the ratio in Eq. (9) is one] or destructive for the partial likelihood since the ratio equals zero [see Eq. (11b)]. Either way, in all cases the estimation of the log hazard ratio by the partial likelihood will be disrupted.
Sampling non-exposed individuals as cases is mandatory for the NECC to meaningful estimate the log hazard rate \({\varvec{\beta }}\) within the Cox proportional hazards model. Assume we only consider exposed individuals as cases, then for every nominator in Equation (9) we obtain
where \(\mathbf {y}=[y_1, \ldots , y_n]^T \in \mathbb {R}^n\). For the ease of presentation, we consider the projection on the first component of \({\varvec{\beta }}\), i.e. the estimation of the regression parameter \(\beta _1\) associated to the main exposure. We stratify the sampled risk set into \({\widetilde{\mathscr {R}}}_i={\widetilde{\mathscr {R}}}_i^0{\dot{\cup }} {\widetilde{\mathscr {R}}}_i^1\) using the covariate values \(x(t_i)\). The \(w_i(t_i, {R})\) only depend on the exposure status within one fixed risk set
which is maximized by \(\beta _1=\infty \). This leads to \(\max _{\beta _1} \mathscr {L}(\beta _1) =\infty \) and thus, an inappropriate estimation if we only consider exposed individuals (\(x_i(t_i)=1\)) to become cases.
B Results traditional nested case-control design
Table 5 follows the structure of Table 1 in the main document and gives the results for a bootstrap analysis of the SIR 3 data using the traditional NCC.
Rights and permissions
About this article
Cite this article
Feifel, J., Gebauer, M., Schumacher, M. et al. Nested exposure case-control sampling: a sampling scheme to analyze rare time-dependent exposures. Lifetime Data Anal 26, 21–44 (2020). https://doi.org/10.1007/s10985-018-9453-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10985-018-9453-4