Abstract
An outcome-dependent sample is generated by a stratified survey design where the stratification depends on the outcome. It is also known as a case–control sample in epidemiological studies and a choice-based sample in econometrical studies. An outcome-dependent enriched sample (ODE) results from combining an outcome-dependent sample with an independently collected random sample. Consider the situation where the conditional probability of a categorical outcome given its covariates follows an explicit model with an unknown parameter whereas the marginal probability of the outcome and its covariates are left unspecified. Profile-likelihood (PL) and weighted-likelihood (WL) methods have been employed to estimate the model parameter from an ODE sample. This article develops the PL- and WL-based families of tests on the model parameter from an ODE sample. Asymptotic properties of their test statistics are derived. The PL likelihood-ratio, Wald and score tests are shown to obey classical inference, i.e. their test statistics are asymptotically equivalent and Chi-squared distributed. In contrast, the WL likelihood-ratio statistic asymptotically has a weighted Chi-squared distribution and is not equivalent to the WL Wald and score statistics. Our theoretical derivation and simulation show that tests based on these new statistics carry nominal type I error and good power. Advantages of ODE sampling together with the implementation of the PL and WL methods are demonstrated in an illustrative example.
Similar content being viewed by others
References
Agresti AA (2002) Categorical data analysis. Wiley-Interscience, Hoboken
Breslow NE, Cain KC (1988) Logistic regression for two-stage case–control data. Biometrika 75:11–20
Breslow NE (1996) Statistics in epidemiology: the case–control study. J Am Stat Assoc 91:14–28
Breslow N, McNeney B, Wellner JA (2003) Large sample theory for semiparametric regression models with two-phase, outcome dependent sampling. Ann Stat 31:1110–1139
Chatterjee N, Chen HY, Breslow NE (2003) A pseudoscore estimator for regression problems with two-phase sampling. J Am Stat Assoc 98:158–168
Chatterjee N, Chen YH (2007) Maximum likelihood inference on a mixed conditionally and marginally specified regression model for genetic epidemiologic studies with two-phase sampling. J R Stat Soc B 69:123–142
Chen HY (2003) A note on the prospective analysis of outcome-dependent samples. J R Stat Soc B 65: 575–584
Cosslett SR (1981a) Efficient estimation of discrete-choice models. In: Manski C, McFadden D (eds) Structural analysis of discrete data with econometric applications. The MIT Press, Cambridge, pp 51–111
Cosslett SR (1981b) Maximum likelihood estimator for choice-based samples. Econometrica 49:1289–1316
Doll R, Hill AB (1950) Smoking and carcinoma of the lung. Br Med J 221:739–748
Doll R, Peto R, Boreham J, Sutherland I (2004) Mortality in relation to smoking: 40 years’ observations on male British doctors. Br Med J 328:1519–1527
Harville DA (1997) Matrix algebra from a statistician’s perspective. Springer, New York
Holt D, Ewings PD (1989) Logistic models for contingency tables. In: Skinner CJ, Holt D, Smith TMF (eds) Analysis of complex surveys. Wiley, New York, pp 261–279
Johnson NL, Kotz S (1970) Continuous univariate distributions, vol 2. Houghton Mifflin, Boston
Kang Q, Nelson PI, Vahl CI (2010) Parameter estimation from an outcome-dependent sample using weighted likelihood method. Statist Sinica 20:1529–1550
Kullback S (1997) Information theory and statistics. Dover Publications, New York
Manski CF, Lerman SR (1977) The estimation of choice probabilities from choice based samples. Econometrica 45:1977–1988
Manski CF, McFadden D (1981) Alternative estimators and sample designs for discrete choice analysis. In: Manski C, McFadden D (eds) Structural analysis of discrete data with econometric applications. The MIT Press, Cambridge, MA, pp 2–50
Manski CF, Thompson TS (1989) Estimation of best predictors of binary response. J Econ 40:97–123
Morgenthaler S, Vardi Y (1986) Choice-based samples: a nonparametric approach. J Econ 32:109–125
Prentice RL, Pyke R (1979) Logistic disease incidence models and case–control studies. Biometrika 66: 403–411
Rao CR (1973) Linear statistical inference and its applications, 2nd edn. Wiley, New York
Rao JNK, Thomas DR (1989) Chi-squared tests for contingency table. In: Skinner CJ, Holt D, Smith TMF (eds) Analysis of complex surveys. Wiley, New York, pp 89–114
Roberts G, Rao JNK, Kumar S (1987) Logistic regression analysis of sample survey data. Biometrika 74:1–12
Rose S, van der Laan MJ (2009) Why match? Investigating matched case–control study designs with causal effect estimation. Int J Biostat 5: Article 1
Scott A, Wild C (1986) Fitting logistic models under case–control or choice based sampling. J R Stat Soc B 48:170–182
Scott AJ, Wild CJ (1997) Fitting regression models to case–control data by maximum likelihood. Biometrika 84:57–71
Vardi Y (1985) Empirical distributions in selection bias models. Ann Stat 13:178–203
Wang XF, Zhou HB (2006) A semiparametric empirical likelihood method for biased sampling schemes with auxiliary covariates. Biometrics 62:1149–1160
Zhou H, Weaver MA, Qin H, Longnecker MP, Wang MC (2002) A semiparametric empirical likelihood method for data from an outcome-dependent sampling scheme with a continuous outcome. Biometrics 58:413–421
Zhou H, Song R, Wu YS, Qin J (2011) Statistical inference for a two-stage outcome-dependent sampling design with a continuous outcome. Biometrics 67:194–202
Acknowledgments
We thank Paul I. Nelson for his constructive comments on this paper. We also thank the anonymous reviewer and the associate editor for their insightful comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Proof of Theorem 1
First we adopt Rao’s (1973, sec. 6e) strategy to convert \({ LR }^{W}\) into a quadratic form. Note that \(\nabla _{\varvec{\uptheta }} l_N^W (\hat{{\varvec{\uptheta }}}^{W})=\mathbf{0}\) and \(\nabla _{\varvec{\upbeta }} l_N^W ({\varvec{\Psi }}(\hat{{\varvec{\upbeta }}}^{W}))=\mathbf{0}\). The chain rule implies \(\nabla _{\varvec{\upbeta }} l_N^W ({\varvec{\Psi }}({\varvec{\upbeta }}))=\nabla _{\varvec{\uptheta }} l_N^W ({\varvec{\uptheta }})\mathbf{M}^{\prime }\). Subject \(\nabla _{\varvec{\uptheta }} l_N^W (\hat{{\varvec{\uptheta }}}^{W})\) and \(\nabla _{\varvec{\upbeta }} l_N^W ({\varvec{\Psi }}(\hat{{\varvec{\upbeta }}}^{W}))\) to first-order Taylor-series expansions at \({\varvec{\uptheta }}^{*}\) and \({\varvec{\upbeta }}^{*}\), respectively, and apply Lemma 1. This leads to
Perform second-order Taylor-series expansions on \(l_N^W ({\varvec{\uptheta }}^{*})\) at \(\hat{{\varvec{\uptheta }}}^{W}\) and \(\hat{{\varvec{\upbeta }}}^{W}\), separately, and take the difference. This leads to
Plugging Formula (6) into the above equation yields
From Johnson and Kotz (1970, pp. 150–151), \({ LR }^{W}\) has the same asymptotic distribution as \(\sum _{i=1}^q {[e_i^W \chi _i^2 (1,0)]} \), where \(e_1^W \ge \cdots \ge e_q^W \) are the eigenvalues of \(\mathbf{O}^{W}\mathbf{V}^{W}\). Note that \(-\mathbf{O}^{W}\mathbf{H}^{W}\) is idempotent of rank \(r\). By Lemma 1, both \(\mathbf{V}^{W}\) and \(-\mathbf{H}^{W}\)are \(p.d\). Hence we find \(\mathbf{O}^{W}\) to be positive-semi-definite of rank \(r\) and, subsequently, eigenvalues of \(\mathbf{O}^{W}\mathbf{V}^{W}\) satisfy that \(e_1^W \ge \cdots \ge e_r^W >0\) and \(e_{r+1}^W =\cdots =e_q^W =0\). This completes the proof of Theorem 1\((i)\).
A similar strategy is used to derive the limiting distribution of \({ LR }^{P}\). Briefly, we have
where \(\mathbf{O}^{P}=\mathbf{N}^{\prime }(\mathbf{NH}^{P}\mathbf{N}^{\prime })^{-1}\mathbf{N}-(\mathbf{H}^{P})^{-1}\), \(\mathbf{N}=diag(\mathbf{M},\mathbf{I}_K )\). To prove Theorem 1(ii), it suffices to show that \(\mathbf{O}^{P}\mathbf{V}^{P}\) is idempotent of rank \(r\). By Lemma 1, \(\exists {\varvec{\Gamma }}\) such that . Given the fact that the last \(K\) rows of \(\mathbf{N}\) is and \(\mathbf{N}^{\prime }(\mathbf{NH}^{P}\mathbf{N}^{\prime })^{-1}\mathbf{NH}^{P}\mathbf{N}^{\prime }=\mathbf{N}^{\prime }\), we obtain
Obviously, \(-\mathbf{N}^{\prime }(\mathbf{NH}^{P}\mathbf{N}^{\prime })^{-1}\mathbf{NH}^{P}+\mathbf{I}_{q+K} \) is idempotent of rank \(r\).
Appendix B: Proof of Theorem 3
Partition \(\mathbf{H}^{W}\) in accordance with \({\varvec{\uptheta }}=({\begin{array}{ll} {\varvec{\upalpha }}&{} {\varvec{\upbeta }} \\ \end{array} })\) into four submatrices:\(\mathbf{H}_{11}^W \), \(\mathbf{H}_{12}^W \), \(\mathbf{H}_{21}^W \), \(\mathbf{H}_{22}^W \). Let \(\mathbf{Q}^{W}=(\mathbf{I}_r \quad -\mathbf{H}_{12}^W (\mathbf{H}_{22}^W )^{-1})\), \(\mathbf{H}^{W11}=[\mathbf{H}_{11}^W -\mathbf{H}_{12}^W (\mathbf{H}_{22}^W )^{-1}\mathbf{H}_{21}^W ]^{-1}\), and . Theorem 8.5.11 of Harville (1997) states that , which yields
According to Lemma 1, Slutsky’s theorem and Formula (7), we have
Perform first-order Taylor-series expansions on \(\nabla _{\varvec{\upalpha }} l_N^W (\mathbf{a},\hat{{\varvec{\upbeta }}}^{W})\) and \(\nabla _{\varvec{\upbeta }} l_N^W (\mathbf{a},\hat{{\varvec{\upbeta }}}^{W})\) at \({\varvec{\upbeta }}^{*}\), respectively. It follows from \(\nabla _{\varvec{\upbeta }} l_N^W (\mathbf{a},\hat{{\varvec{\upbeta }}}^{W})=\mathbf{0}\) that
Also note that \(\nabla _{\varvec{\uptheta }} l_N^W (\hat{{\varvec{\uptheta }}}^{W})=\mathbf{0}\). Performing a first-order Taylor-series expansion on \(\nabla _{\varvec{\uptheta }} l_N^W (\hat{{\varvec{\uptheta }}}^{W})\) at \(({\begin{array}{ll} \mathbf{a}&{} {{\varvec{\upbeta }}^{*}} \\ \end{array} })\) and applying Formula (7) leads to
It is thus seen from Formulas (8), (9) and (10) that \(Score^{W}=Wald^{W}+o_p (1)\). The asymptotic distribution of \(Score^{W}\) and \(Wald^{W}\) is a direct result of Lemma 1 and Johnson and Kotz (1970, pp. 150–151). This completes of proof of Theorem 3\((i)\).
To prove Theorem 3(ii), first note that \(\mathbf{MH}^{W}\mathbf{M}^{\prime }=\mathbf{H}_{22}^W \). Perform second-order Taylor-series expansions on \(l_N^W ({\varvec{\uptheta }}^{*})\) and \(l_N^W (\mathbf{a},{\varvec{\upbeta }}^{*})\) at \(\hat{{\varvec{\uptheta }}}^{W}\) and \(\hat{{\varvec{\upbeta }}}^{W}\), separately, and take the difference. This generates
For \({\varvec{\upalpha }}^{*}=\mathbf{a}+N^{-1/2}{\varvec{\Delta }}\), we have
Recall that \(\hat{{\varvec{\uptheta }}}^{W}-{\varvec{\uptheta }}^{*}=-\nabla _{\varvec{\uptheta }} l_N^W ({\varvec{\uptheta }}^{*})(\mathbf{H}^{W})^{-1}+O_p (N^{-1})\). Collecting the information above, we then convert \({ LR }^{W}\) to a quadratic form as
Apply Cholesky decomposition to the \(r\times r\,p.d.\) matrix \(\mathbf{Q}^{W}\mathbf{V}^{W}(\mathbf{Q}^{W})^{\prime }\) to get \(\mathbf{Q}^{W}\mathbf{V}^{W}(\mathbf{Q}^{W})^{\prime }=\mathbf{LL}^{\prime }\). Let \(e_1^W \ge \cdots \ge e_r^W >0\) be the eigenvalues of \(-\mathbf{L}^{\prime }\mathbf{H}^{W11}\mathbf{L}\) and let \(\mathbf{P}\) be the associated orthogonal matrix of eigenvectors, i.e. \(-\mathbf{P}^{\prime }\mathbf{L}^{\prime }\mathbf{H}^{W11}\mathbf{LP}=diag(e_1^W ,\ldots ,e_r^W )\). Further denote \(\mathbf{p}_i \) as the \(\hbox {i}^{\mathrm{th}}\) row vector of \(\mathbf{P}\). From Johnson and Kotz (1970, pp. 150–151) and Lemma 1, \({ LR }^{W}\) has a limiting distribution of \(\sum _{i=1}^r {[e_i^W \chi _i^2 (1,\varphi _i )]} \), \(\varphi _i =0.5{\varvec{\Delta }}(\mathbf{L}^{\prime }\mathbf{H}^{W11})^{-1}(\mathbf{p}_i )^{\prime }\mathbf{p}_i (\mathbf{H}^{W11}\mathbf{L})^{-1}{\varvec{\Delta }}^{\prime }\). Because \((\mathbf{L}^{\prime }\mathbf{H}^{W11})^{-1}\mathbf{P}^{\prime }\mathbf{P}(\mathbf{H}^{W11}\mathbf{L}^{\prime })^{-1}\) is \(p.d.\), \(\varphi _1 =\cdots =\varphi _r =0\) iff \({\varvec{\Delta }}=\mathbf{0}\). It is easy to see that \(\mathbf{O}^{W}\mathbf{V}^{W}=-(\mathbf{Q}^{W})^{\prime }\mathbf{H}^{W11}\mathbf{Q}^{W}\mathbf{V}^{W}\) has the same eigenvalues as \(-\mathbf{L}^{\prime }\mathbf{H}^{W11}\mathbf{L}\). This completes the proof of Theorem 3(ii).
With respect to Theorem 3(iii), partition \(\mathbf{H}^{P}\) into \(\mathbf{H}_{11}^P \), \(\mathbf{H}_{12}^P \), \(\mathbf{H}_{21}^P \), \(\mathbf{H}_{22}^P \) by separating out \({\varvec{\upalpha }}\) from \({\varvec{\upbeta }}\) and \({\varvec{\upxi }}_{+Y} \). Set \(\mathbf{Q}^{P}=(\mathbf{I}_r \quad -\mathbf{H}_{12}^P (\mathbf{H}_{22}^P )^{-1})\) and . The fact that for some \({\varvec{\Gamma }}\) assures that \(-\mathbf{H}^{P11}=[\mathbf{Q}^{P}\mathbf{V}^{P}(\mathbf{Q}^{P})^{\prime }]^{-1}\) (we leave the proof of this equation to the reader). Our formulation of \(Score^{P}\) is a direct application of this equality. Analogous to the proof for Theorem 3\((i)\), we have
Like Formula (11), \({ LR }^{P}\) can be converted to a quadratic form as
Rights and permissions
About this article
Cite this article
Vahl, C.I., Kang, Q. Analysis of an outcome-dependent enriched sample: hypothesis tests. Stat Methods Appl 24, 387–409 (2015). https://doi.org/10.1007/s10260-014-0285-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10260-014-0285-4