Abstract
Outcome-dependent sampling designs such as the case–control or case–cohort design are widely used in epidemiological studies for their outstanding cost-effectiveness. In this article, we propose and develop a smoothed weighted Gehan estimating equation approach for inference in an accelerated failure time model under a general failure time outcome-dependent sampling scheme. The proposed estimating equation is continuously differentiable and can be solved by the standard numerical methods. In addition to developing asymptotic properties of the proposed estimator, we also propose and investigate a new optimal power-based subsamples allocation criteria in the proposed design by maximizing the power function of a significant test. Simulation results show that the proposed estimator is more efficient than other existing competing estimators and the optimal power-based subsamples allocation will provide an ODS design that yield improved power for the test of exposure effect. We illustrate the proposed method with a data set from the Norwegian Mother and Child Cohort Study to evaluate the relationship between exposure to perfluoroalkyl substances and women’s subfecundity.
Similar content being viewed by others
References
Andersen PK, Gill RD (1982) Cox’s regression model for counting processes: a large sample study. Ann Stat 10:1100–1120
Breslow NE, Cain KC (1988) Logistic regression for two-stage case-control data. Biometrika 75:11–20
Breslow NE, Holubkov R (1997) Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. J R Stat Soc B 59:447–461
Brown BM, Wang YG (2007) Induced smoothing for rank regression with censored survival times. Stat Med 26:828–836
Cai J, Zeng D (2007) Power calculation for case-cohort studies with nonrare events. Biometrics 63:1288–1295
Chiou S, Kang S, Yan J (2014) Fast accelerated failure time modeling for case-cohort data. Stat Comput 24:559–568
Chen K (2001) Generalized case-cohort sampling. J R Stat Soc B 63:791–809
Ding J, Zhou H, Liu Y, Cai J, Longnecker M (2014) Estimating effect of environmental contaminants on women’s subfecundity for the MoBa study data with an outcome-dependent sampling scheme. Biostatistics 15:636–650
Fleming TR, Harrington DP (1991) Counting processes and survival analysis. Wiley, New York
Fygenson M, Ritov Y (1994) Monotone estimating equations for censored data. Ann Stat 22:732–746
Hájek J (1960) Limiting distributions in simple random sampling from a finite population. Publ Math Inst Hung Acad Sci 5:361–374
Jin Z, Lin DY, Wei LJ, Ying Z (2003) Rank-based inference for the accelerated failure time model. Biometrika 90:341–353
Johnson L, Strawderman R (2009) Induced smoothing for the semiparametric accelerated failure time model: asymptotics and extensions to clustered data. Biometrika 93:577–590
Kang S, Cai J (2009) Marginal hazards model for case-cohort studies with multiple disease outcomes. Biometrika 96:887–901
Kang S, Cai J, Chambless L (2013) Marginal additive hazards model for case-cohort studies with multiple disease outcomes: an application to the Atherosclerosis Risk in Communities (ARIC) study. Biostatistics 14:28–41
Kim S, Cai J, Lu W (2013) More efficient estimators for case-cohort studies. Biometrika 100:695–708
Kim J, Sit T, Ying Z (2016) Accelerated failure time model under general biased sampling scheme. Biostatistics 17:576–588
Kong L, Cai J (2009) Case-cohort analysis with accelerated failure time model. Biometrics 65:135–142
Kulich M, Lin DY (2000) Additive hazards regression with covariate measurement error. J Am Stat Assoc 95:238–248
Magnus P, Irgens L, Haug K, Nystad W, Skjærven R, Stoltenberg C, The MoBa Study Group (2006) Cohort profile: the Norwegian Mother and Child Cohort Study (MoBa). Int J Epidemiol 35:1146–1150
Nan B, Yu M, Kalbfleisch JD (2006) Censored linear regression for case-cohort studies. Biometrica 93:747–762
Novák P (2013) Goondess-of-fit test for accelerated failure time model based on martingale residuals. Kyberanetika 49:40–59
Prentice RL (1978) Linear rank tests with right-censored data. Biometrika 65:167–179
Prentice RL (1986) A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 73:1–11
Prentice RL, Pyke R (1979) Logistic disease incidence models and case-control studies. Biometrika 66:403–412
Pollard D (1990) Empirical processes: theory and applications. Institute of Mathematical Statistics, Hayward
Schildcrout JS, Garbett SP, Heagerty PJ (2013) Outcome vector dependent sampling with longitudinal continuous response data: stratified sampling based on summary statistics. Biometrics 69:405–416
Schildcrout JS, Rathouz PJ, Zelnick LR, Garbett SP, Heagerty PJ (2015) Biased sampling designs tofimprove research efficiency: factors in uencing pulmonary function over time in children with asthma. Ann Appl Stat 9:731–753
Song R, Zhou H, Kosorok M (2009) A note on semiparametric efficient inference for two-stage outcome-dependent sampling with a continuous outcome. Biometrika 96:221–228
Tsiatis AA (1990) Estimating regression parameters using linear rank tests for censored data. Ann Stat 18:354–372
Tan Z, Qin G, Zhou H (2016) Estimation of a paritally linear additive model for data from an outcome-dependent sampling design with a continuous outcome. Biostiatistics 17:663–676
van der Vaart AW, Wellner JA (1996) Weak convergence and empirical processes. Springer, New York
Wang X, Zhou H (2010) Design and inference for cancer biomarker study with an outcome and auxiliary-dependent subsampling. Biometrics 66:502–511
Weaver MA, Zhou H (2005) An estimated likelihood method for continuous outcome regression models with outcome-dependent sampling. J Am Stat Assoc 100:459–469
Weinberg C, Wacholder S (1993) Prospective analysis of case-control data under general multiplicativeintercept risk models. Biometrika 80:461–465
Whitworth KW, Haug LS, Barid DD, Becher G, Hoppin JA, Skjaerven R, Thomsen C, Eggesbo M, Travlos G, Wilson R, Longnecker MP (2012) Perfluorinated compounds and subfecundity in pregnant women. Epidemiology 23:257–263
Ying Z (1993) A large sample study of rank estimation for censored regression data. Ann Stat 21:76–99
Yu J, Liu Y, Cai J, Sandler DP, Zhou H (2016) Design and inference with an outcome-dependent sampling scheme under the Cox proportional hazards model. J Stat Plan Inference 178:24–36
Zeng D, Lin DR (2007) Efficient estimation for the accelerated failure time model. J Am Stat Assoc 102:1387–1396
Zhou H, Chen J, Rissnen T, Korrick S, Hu H, Salonen J, Longnecker MP (2007) Outcome-dependent sampling: an efficient sampling and inference procedure for studies with a continuous outcome. Epidemiology 18:461–468
Zhou H, Qin G, Longnecker M (2011) A partial linear model in the outcome-dependent sampling setting to evaluate the effect of prenatal PCB exposure on cognitive function in children. Biometrics 67:876–885
Zhou H, Weaver M, Qin J, Longnecker M, Wang MC (2002) A semiparametric empirical likelihood method for data from an outcome-dependent sampling scheme with a continuous outcome. Biometrics 58:413–421
Zhou H, Xu W, Zeng D, Cai J (2014) Semiparametric inference for data with a continuous outcome from a two-phase probability-dependent sampling scheme. J R Stat Soc B 76:197–215
Acknowledgements
This work is partly supported by the National Science Foundation of China Grants 11501578 and 11701571 (for Yu), and National Institutes of Health Grants P42ES031007 Super fund, P30ES010126, and P01 CA142538 (for Cai and Zhou).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix
Appendix
In order to establish the asymptotic properties of the proposed estimator, we need the following two lemmas.
Lemma 1
Under Conditions \((C1){-}(C4)\) and (C6), \(m^{-1/2}{\tilde{U}}_{m,G}(\theta _0)\) is asymptotically normal with zero-mean and covariance matrix \(\Sigma _{F}(\theta _0)+\Sigma _{O}(\theta _0)\).
Proof
Using martingales expression \(M_i(\theta _0;t), i=1,\ldots , m\), we have
Obviously, the second term of (7.1) is equal to zero. Therefore, the term \(m^{-1/2}{\tilde{U}}_{m,G}(\theta _0)\) can be written as
Next, we will show the second term of (7.2) is asymptotically negligible, which means
For each i, \(W_iM_i(\theta _0;t)\) is a zero-mean process, which can be expressed as a sum of two monotone processes on the interval \([-Con_M,Con_M]\). Due to Conditions (C1) and (C2) and the follow-up time of the studies being bounded, the integrable interval \((-\infty ,+\infty )\) in formula (7.1) should be an interval of \([-Con_M,Con_M]\), which is a compact set in real space \({\mathcal {R}}\) with \(Con_M\) being a positive constant and similar to the condition A of Tsiates (1990). Hence, the term \(m^{-1/2}\sum _{i=1}^{m} W_iM_i(\theta _0;t)\) converges weakly to a tight Gaussian process with continuous sample paths on \([-Con_M,Con_M]\) by Example 2.11.16 of van der Vaart and Wellner (1996). We assume \(X_i \ge 0\), otherwise, we decompose each \(X_i(\cdot )\) into its positive and negative parts. Because \({\tilde{X}}(\theta _0;t)\) is a product of two monotone processes, which converges uniformly in probability to \(e_X(\theta _0;t)\) on a compact set \([-Con_M,Con_M]\) in \({\mathcal {R}}\). Using Lemma A.1 of Kulich and Lin (2000), (7.3) holds. Hence,
The first term of the right-side of (7.4) can be written as:
In order to simplify the expression, we define \(H_i(\theta _0)=\int _{-\infty }^{\infty }s^{(0)}(\theta _0;t) [X_i-e_{X}(\theta _0;t)]dM_i(\theta _0;t)\) and the formula (7.5) can be written as following
The five terms on the right-hand side of (7.6) have mean zero. Because \(E[\xi /(\rho _0\rho _V)]=1\), the covariance matrix between the first term and the second term is
By similar arguments, we can obtain the five terms on the right hand side of (7.6) are uncorrelated with each other. Besides, each term is a sum of independent and identically distributed zero-mean random vectors. Using a slight extension of H\(\acute{a}\)jek’s (1960) central limit theorem, \(m^{-1/2}{\tilde{U}}_{m,G}(\theta _{0})\) can be shown to converge in distribution to a zero-mean normal vector with covariance matrix being \(\Sigma _{F}(\theta _0)+\Sigma _{O}(\theta _0)\), where \(\Sigma _F(\theta _0)=E[H_1(\theta _0)^{\otimes 2}]\), \(\Sigma _O(\theta _0)=\frac{1-\rho _0\rho _V}{\rho _0\rho _V}E[(1-\delta _1)H_1(\theta _0)^{\otimes 2}]+ \frac{1-\rho _0\rho _V}{\rho _0\rho _V}E[\delta _1(1-\zeta _1)H_1(\theta _0)^{\otimes 2}]+ \sum \limits _{k=\{1,3\}}\frac{(1-\rho _0\rho _V)(\pi _k(1-\rho _0\rho _V)-\rho _k\rho _V)}{\rho _k\rho _V} E[\delta _1\zeta _{1,k}H_1(\theta _0)^{\otimes 2}]\), with \(a^{\otimes 2}=aa^{'}\) for a vector a. Therefore, Lemma 1 holds. \(\square \)
Lemma 2
Under Conditions \((C1){-}(C4)\), the weighted Gehan estimating function and the smoothed weighted Gehan estimating function are asymptotically equivalent:
Proof
Due to the induced smoothness method, we have
Due to the inequality \(I(e_j(\theta _0)-e_i(\theta _0)\ge 0)-\Phi (\frac{e_j(\theta _0) -e_i(\theta _0)}{r_{ij}})\le \Phi (-|\frac{e_j(\theta _0) -e_i(\theta _0)}{r_{ij}}|)\), we can obtain
Because of \(\Phi (-x)\le (\sqrt{2\pi }x)^{-1}\exp \{-x^2/2\}\), we can obtain \(\lim \nolimits _{x\rightarrow +\infty }x\Phi (-x)=0\). Due to the fact \(r_{ij}=\sqrt{\frac{(X_j-X_i)^{'}(X_j-X_i)}{m}}\), the term \(|\frac{e_j(\theta _0) -e_i(\theta _0)}{r_{ij}}|\Phi (-|\frac{e_j(\theta _0) -e_i(\theta _0)}{r_{ij}}|)=|\frac{\sqrt{m}[e_j(\theta _0) -e_i(\theta _0)]}{\sqrt{(X_j-X_i)^{'}(X_j-X_i)}}| \Phi (-|\frac{\sqrt{m}[e_j(\theta _0) -e_i(\theta _0)]}{\sqrt{(X_j-X_i)^{'}(X_j-X_i)}}|)\) goes to zero as m goes to infinity. Therefore, Lemma 2 holds by applying the strong law of large number (Pollard 1990). \(\square \)
Proof of Theorem 1:
Due to the fact that \({\tilde{U}}_{m,G}(\theta )\) is the gradient of the convex objective function
a parameter estimator could be obtained by minimizing \(L_m(\theta )\) with respect to \(\theta \) and the resulting set of solutions is also convex. However, the lack of smoothness also presents computational challenges. We can use standard results for normal random variables and integration by parts to obtain
where the function \(\phi (\cdot )\) is a standard normal density function. A straightforward calculation can show that \({\bar{U}}_{m,G}(\theta )=\partial {\bar{L}}_m(\theta )/\partial \theta \). The smoothed objective function \({\bar{L}}_m(\theta )\) is convex and continuously differentiable. Hence, the standard numerical methods can be used to obtain \({\widehat{\theta }}_m=\arg \min \nolimits _{\theta \in {\mathcal {B}}} {\bar{L}}_m(\theta )\). By Lemmas 1 and 2 of Johnson and Strawderman (2009), the respective minimizers \({\tilde{\theta }}_m\) and \({\widehat{\theta }}_m\) of \(L_m(\theta )\) and \({\bar{L}}_m(\theta )\) thus converge almost surely to \(\theta _0\) (Andersen and Gill 1982, Corollary II.2). By Taylor expansion of \({\bar{U}}_{m,G}(\theta )\) around \(\theta _0\), we have
where \(\theta ^*\) is between \(\theta \) and \(\theta _0\). Inserting \({\widehat{\theta }}_{m}\) in the above equation, we can obtain
with \(\theta ^*\) being between \({\widehat{\theta }}_{m}\) and \(\theta _0\). The asymptotic normality of \({\widehat{\theta }}_{m}\) can be established based on Lemmas 1 and 2, Condition (C5), and the consistency of \({\widehat{\theta }}_{m}\). Hence, Theorem 1 holds. \(\square \)
Rights and permissions
About this article
Cite this article
Yu, J., Zhou, H. & Cai, J. Accelerated failure time model for data from outcome-dependent sampling. Lifetime Data Anal 27, 15–37 (2021). https://doi.org/10.1007/s10985-020-09508-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10985-020-09508-y
Keywords
- Accelerated failure time model
- Induced smoothing
- Outcome-dependent sampling
- Wald statistic
- Survival data