Abstract
The data obtained from case–control sampling may suffer from selection or reporting bias, resulting in biased estimation of the parameter(s) of interest by standard analysis of case–control data. In this work, the problem of this bias is dealt with by introducing the concept of reporting probability. Then, considering a reference sample from the source population, we obtain asymptotically unbiased estimate of the population parameters by fitting a pseudo-likelihood, assuming the exposure distribution in the population to be unknown and arbitrary. The proposed estimates of the model parameters follow asymptotically a normal distribution and are semiparametrically fully efficient. We motivate the need for such methodology by considering the analysis of spontaneous adverse drug reaction (ADR) reports in presence of reporting bias.
Similar content being viewed by others
References
Bate, A., Lindquist, M., Edwards, I. R., Olsson, S., Orre, R., Lansner, A., et al. (1998). A Bayesian neural network method for adverse drug reaction signal generation. European Journal of Clinical Pharmacology, 54, 315–321.
Bickel, P. J., Klaassen, C. A., Ritov, Y., Wellner, J. A. (1993). Efficient and adaptive estimation for semi-parametric models. Baltimore: John Hopkins University Press.
Breslow, N. E. (1996). Statistics in epidemiology: The case–control study. Journal of American Statistical Association, 91(433), 14–28.
Cosslett, S. R. (1981). Maximum likelihood estimator for choice-based samples. Econometrica, 49, 1289–1316.
Ghosh, P., Dewanji, A. (2011). Analysis of spontaneous adverse drug reaction (ADR) reports using supplementary information. Statistics in Medicine, 30(16), 2040–2055.
Gilbert, P. B., Lele, S. R., Vardi, Y. (1999). Maximum likelihood estimation in semiparametric selection bias models with application to AIDS vaccine trials. Biometrika, 86(1), 27–43.
Hartz, I., Sakshaug, S., Furu, K., Engeland, A., Eggen, A. E., Njolstad, I., et al. (2007). Aspects of statin prescribing in Norwegian counties with high, average and low statin consumption-an individual-level prescription database study. BMC Clinical Pharmacology, 7(14), 1–6.
Hirose, Y. (2005). Efficiency of the semi-parametric maximum likelihood estimator in generalized case–control studies. PhD thesis, University of Auckland.
Hsieh, D. A., Manski, C. F., McFadden, D. (1985). Estimation of response probabilities from augmented retrospective observations. Journal of American Statistical Association, 80(391), 651–662.
Lee, A., Hirose, Y. (2010). Semi-parametric efficiency bounds for regression models under response-selective sampling: the profile likelihood approach. Annals of the Institute of Statistical Mathematics, 62, 1023–1052.
Lee, A. J., Scott, A. J., Wild, C. J. (2006). Fitting binary regression models with case-augmented samples. Biometrika, 93(2), 385–397.
Mann, R. D. (1998). Prescription-event monitoring—recent progress and future horizons. British Journal of Clinical Pharmacology, 46, 195–201.
Neuhaus, J., Scott, A. J., Wild, C. J. (2002). The analysis of retrospective family studies. Biometrika, 89, 23–37.
Newey, W. K. (1990). Semiparametric efficiency bounds. Journal of Applied Econometrics, 5(2), 99–135.
Prentice, R. L., Breslow, N. E. (1978). Retrospective studies and failure time models. Biometrika, 65(1), 153–158.
Prentice, R. L., Pyke, R. (1979). Logistic disease incidence models and case–control studies. Biometrika, 66(3), 403–411.
Scott, A. J., Wild, C. J. (1997). Fitting regression models to case–control data by maximum likelihood. Biometrika, 84, 57–71.
Scott, A. J., Wild, C. J. (2001). Maximum likelihood for generalised case–control studies. Journal of Statistical Planning and Inference, 96, 3–27.
Wild, C. J. (1991). Fitting prospective regression models to case–control data. Biometrika, 78(4), 705–717.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Pseudo-log-likelihood
The ‘pseudo-log-likelihood’ (6) is obtained from the log-likelihood (5). In order to obtain the profile likelihood, as discussed in Sect. 3, the log-likelihood (5) is maximized over \(\varvec{\delta }\) for fixed \(\varvec{\phi }\). Introducing the Lagrange multiplier \(\lambda \) to take care of the constraint \(\sum _{i=1}^{K}\delta _{i}=1\) and equating the derivative of the log-likelihood (5) with respect to \(\delta _{i}\) to zero, we get
Multiplying (10) by \(\delta _{i}\) and summing over \(i\), we have \(\lambda = -n_{J+1}\). Using this value of \(\lambda \) in (10), the expression for \(\delta _{i}\) can be written as
From (11), after setting an offset parameter \(\rho \) as
we have \(\delta _{i} = (n_{J+1,i} + \sum _{j=0}^{J} n_{ji})/(n_{J+1} (1 + \sum _{j=0}^{J}\mathrm{e}^{\rho }\mu _{ji}{p_{ji}}))\), which is substituted in (5) to get the pseudo-log-likelihood (6). Note that the \(\rho \) in (12) satisfies \(\partial l^{*}(\psi )/\partial \rho = 0\), where \( l^{*}(\psi )\) is given by (6).
To justify this offset parameter \(\rho \) being independent of \(j\), consider \(n_{0}/n_{j}\) as a consistent estimator of \(P(R=1, Y=0)/P(R=1, Y=j)\) (Scott and Wild 1997) so that
Note that \(n_{j}/n\) tends to \(\omega _{j}\) in probability and \(n_{0}/n_{j}\) tends to \(\omega _{0}/\omega _{j}\) in probability as \(n \rightarrow \infty \) with \(n = \sum _{l=0}^{J+1}n_{l}\), resulting in
which leads to
the population counterpart of (12). The implicit dependence of \(\rho \) on \(\varvec{\phi }\), written as \(\rho = \rho (\varvec{\phi })\), is clear from the above description.
Appendix B: Asymptotics
The asymptotic properties of the estimator \( \varvec{\hat{\psi }} = (\varvec{\hat{\phi }}, \hat{\rho })\) obtained by maximizing the pseudo log-likelihood (6) are established by considering the multi-sample representation of Hirose (2005) and Lee et al. (2006). Let \(E_{j}\) denote the expectation with respect to the conditional distribution of exposure \(X\), given \(Y=j\), having density \(f_{j}(x,\varvec{\phi },g)=\mu _{j}(x,\varvec{\gamma })p_{j}(x,\varvec{\beta })g(x)/\pi _{j}\) with \(\pi _{j} = \int \mu _{j}(x,\varvec{\gamma })p_{j}(x,\varvec{\beta })g(x) \mathrm{d}x\), for \(j=0,\ldots ,J,\) and \(E_{J+1}\) denote the expectation with respect to the unconditional distribution of \(X\) having density \(f_{J+1}(x,\varvec{\phi },g)=g(x)\).
As in Lee et al. (2006), the estimating equation from (6), the pseudo log-likelihood equation, can be written as
where
Note that \(x_{ji}\), for \(i=1,\ldots ,n_{j}\), are independent random variables with common density \(f_{j}(x,\varvec{\phi }, g)\), for \(j=0,1,\ldots ,J+1\), as mentioned above. Then, we have
for \(j=0,1,\ldots ,J\), using (12) and (14). Similarly,
Hence,
Now, we use the results related to estimating function and asymptotic linear estimator (see (Hirose 2005, p 72–79)). Here, the estimating function is \(\partial Z_{j}(x, \varvec{\psi })/\partial \varvec{\psi }\) with the corresponding asymptotic linear estimator \(\varvec{\hat{\psi }}\). Then, the asymptotic distribution of \(\sqrt{n}(\varvec{\hat{\psi }} - \varvec{\psi })\) is multivariate normal (see Hirose 2005, p 67–79) with mean zero and variance–covariance matrix given by
where
In our context, it can be shown that the variance–covariance matrix (19) has the form
where \(H\) is a scalar element. The resulting variance–covariance matrix of \(\sqrt{n}(\varvec{\hat{\phi }} - \varvec{\phi })\) is \([I_{\varvec{\phi } \varvec{\phi }} - I_{\varvec{\phi } \rho }I_{\rho \rho }^{-1}I_{\rho \varvec{\phi }}]^{-1}\), where \(I(\varvec{\psi })\) is partitioned as
Note that \(nI(\varvec{\psi })\) can be consistently estimated by \(-\partial ^{2} l^{*}(\varvec{\psi })/\partial \varvec{\psi } \partial \varvec{\psi }^{T}\) evaluated at \(\varvec{\psi } = \varvec{\hat{\psi }}\). From (20), \(\Sigma \) can be written as
To establish (21), we need to show that the second term of (23) is (see Neuhaus et al. 2002)
Note that,
for \(j=0,\ldots ,J+1\). Using (16) and (17), the second term of (23) becomes
Using (25), the information matrix can be written as
Now,
using the results,
Since the last column of \(I(\varvec{\psi })\) is \(-E_{X}[\omega _{J+1} T(X) \frac{\partial Z_{J+1} }{\partial \varvec{\psi }}]\), it can be checked that (see Neuhaus et al. 2002)
Note that, \(\sum _{j=0}^{J+1}Z_{j} =1\) implies
where \(T(X) = (1 + \sum _{l=0}^{J}\mathrm{e}^{\rho }\mu _{l}(X,\varvec{\gamma })p_{l}(X,\varvec{\beta }))\). Now, we claim that
where \( \sum _{j=0}^{J} \tau _{j}(\varvec{\psi }) = 1\). Using (29) and (30), we have
Hence, using (23), (21) is established.
Appendix C: Semiparametric efficiency
Following Bickel et al. (1993), the asymptotic variance matrix for a regular asymptotically linear (RAL) estimate \(\varvec{\hat{\phi }}\) of \(\varvec{\phi }\) satisfies \(V(\varvec{\hat{\phi }}) \ge B\), where \(B\) is the semiparametric efficiency bound. Lee and Hirose (2010) have obtained this bound \(B\) for the semiparametric maximum likelihood estimate of parameters in general regression model when data are collected under response-selective sampling scheme. In order to apply their results in our context, let us consider the “population expected likelihood” (see also Newey 1990; Lee et al. 2006) as given by
where \(E_{j}\) is the expectation with respect to the conditional distribution of exposure \(X\), given \(Y=j\), having density \(f_{j}(x,\varvec{\phi },g)=\mu _{j}(x,\varvec{\gamma })p_{j}(x,\varvec{\beta })g(x)/\pi _{j}\) with \(\pi _{j} = \int \mu _{j}(x,\varvec{\gamma })p_{j}(x,\varvec{\beta })g(x) \mathrm{d}x\), for \(j=0,\ldots ,J\) and \(E_{J+1}\) is the expectation with respect to the unconditional distribution of \(X\) having density \(f_{J+1}(x,\varvec{\phi },g)=g(x)\), where \(g(x)\) is the density corresponding to the exposure distribution \(G(x)\) of \(X\). Then, the efficient scores are given by
where \(\hat{g}(\varvec{\phi })\) is the maximizer of (32), for fixed \(\varvec{\phi }\). Then, the corresponding efficiency bound \(B\) is given by
To show that the asymptotic variance matrix of \(\varvec{\hat{\phi }}\) is equal to the semiparametric efficiency bound, we need to show that
From (32), the expected log-likelihood
where \(\pi ^{0}_{j}= \int \mu _{j}(x,\varvec{\gamma _{0}})p_{j}(x,\varvec{\beta _{0}})g_{0}(x)\mathrm{d}x\). Considering the terms which involve \(g(x)\), (36) can be written as
Now, we need to find \(\hat{g}\) which maximizes (37). Consider the class of distribution of \(X\) to be discrete with finite support \(\{x_{1},\ldots ,x_{M} \}\). Suppose a general member \(g(\cdot )\) of this class has mass \(g_{i}\) at \(x_{i}\). Note that the true distribution \(g_{0}(\cdot )\) is a member of this class having mass \(g_{0}(x_{i})\), say, at \(x_{i}\). Then, (37) can be written along with Lagrange multiplier \(\lambda \) to take care of the constraint \(\sum _{i=1}^{M}g_{i} =1\), as
where \(\pi _{j}(\varvec{g}) = \sum _{i=1}^{M} \mu _{ji}p_{ji} g_{i}\). Differentiating (38) with respect to \(g_{i}\), we have
Multiplying (39) by \(g_{i}\) and summing over \(i\) give \(\lambda = -\omega _{J+1}\). Putting the value of \(\lambda \) in (39), we get the estimate of \(g_{i}\) as
In case of general \(g\), not having finite support, the maximizer of (37) is of the form
(see Lee and Hirose 2010; Lee et al. 2006), where \(\pi _{j}(\rho )\) satisfies \(\mathrm{e}^{\rho } \!=\! \omega _{j}/(\pi _{j}(\rho )\omega _{J+1})\) (see (14)) for \(j=0,1\ldots ,J\) and \(\rho = \rho (\varvec{\phi })\) is the solution of the equations
for \(j=0,1,\ldots ,J\). Putting the value of \(\hat{g}\) in the densities \(f_{j}(x,\varvec{\phi },g)\), for \( j=0,1,\ldots ,J\), we have
and
where \(c_{j}\)’s are constants with respect to \(\varvec{\psi }=(\rho ,\varvec{\phi })\). Using (33), the efficient scores are given as
where \(q_{J+1}(x,\varvec{\phi },\rho (\varvec{\phi })) = 1 - \sum _{j=0}^{J}q_{j}(x,\varvec{\phi },\rho (\varvec{\phi }))\) and all the derivatives are evaluated at \(\varvec{\phi }=\varvec{\phi }_{0}\). Applying chain rule,
Note that the information matrix (See Lee et al. 2006) is given by
Differentiating (42) under the integral sign
It can be easily checked that
Summing over \(j=0,1,\ldots ,J\) in (49) and using (50)
This establish the efficiency bound of the semiparametric procedure.
About this article
Cite this article
Ghosh, P., Dewanji, A. Regression analysis of biased case–control data. Ann Inst Stat Math 68, 805–825 (2016). https://doi.org/10.1007/s10463-015-0511-3
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-015-0511-3