Abstract
The density function is a fundamental concept in data analysis. When a population consists of heterogeneous subjects, it is often of great interest to estimate the density functions of the subpopulations. Nonparametric methods such as kernel smoothing estimates may be applied to each subpopulation to estimate the density functions if there are no missing values. In situations where the membership for a subpopulation is missing, kernel smoothing estimates using only subjects with membership available are valid only under missing complete at random (MCAR). In this paper, we propose new kernel smoothing methods for density function estimates by applying prediction models of the membership under the missing at random (MAR) assumption. The asymptotic properties of the new estimates are developed, and simulation studies and a real study in mental health are used to illustrate the performance of the new estimates.
Similar content being viewed by others
References
Alonzo, T.A., Pepe, M.S.: Assessing accuracy of a continuous screening test in the presence of verification bias. J. R. Stat. Soc. Ser. C 54, 173–190 (2005)
Alonzo, T.A., Pepe, M.S., Lumley, T.: Estimating disease prevalence in two-phase studies. Biostatistics 4, 313–326 (2003)
Begg, C.B., Greenes, R.A.: Assessment of diagnostic tests when disease verification is subject to selection bias. Biometrics 39, 207–215 (1983)
Chaudron, L., Szilagyi, P., Tang, W., Anson, E., Talbot, N., Wadkins, H., Tu, X., Wisner, K.: Accuracy of depression screening tools for identifying postpartum depression among urban mothers. Pediatrics 125(3), e609–e617 (2010)
He, H., Lyness, J.M., McDermott, M.P.: Direct estimation of the area under the receiver operating characteristic curve in the presence of verification bias. Stat. Med. 28(3), 361–376 (2009)
He, H., McDermott, M.: A robust method for correcting verification bias for binary tests. Biostatistics 13(1), 32–47 (2012)
Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47, 663–685 (1952)
Linn, B.S., Linn, M.W., Gurel, L.: Cumulative illness rating scale. J. Am. Geriatr. Soc. 16, 622–626 (1968)
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, 2nd edn. Wiley, Hoboken (2002)
Parzen, E.: On estimation of a probability density function and mode. Ann. Math. Stat. 33, 1065–1076 (1962)
Pepe, M.S., Reilly, M., Fleming, T.R.: Auxiliary outcome data and the mean score method. J. Stat. Plan. Inference 42, 137–160 (1994)
Reilly, M., Pepe, M.S.: A mean score method for missing and auxiliary covariate data in regression models. Biometrika 82, 299–314 (1995)
Simonoff, J.S.: Smoothing Methods in Statistics. Springer, New York (1996)
Spitzer, R.L., Gibbon, M., Williams, J.B.W.: Structured Clinical Interview for Axis I DSM-IV Disorders. Biometrics Research Department, New York State Psychiatric Institute (1994)
Tang, W., He, H., Gunzler, D.: Kernel smoothing density estimation when group membership is subject to missing. J. Stat. Plan. Inference 142(3), 685–694 (2012)
Wand, M.P., Jones, M.C.: Kernel Smoothing, Volume 60 of Monographs on Statistics and Applied Probability. Chapman and Hall Ltd., London (1995)
Wang, Q.: Probability density estimation with data missing at random when covariables are present. J. Statist. Plann. Inference 138(3), 568–587 (2008)
White, H.: Maximum likelihood estimation of misspecified models. Econometrica 50(1), 1–25 (1982)
Acknowledgements
This research was supported in part by NIH Grants R33 DA027521 and R01GM108337. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors would also like to thank Jeffrey M. Lyness, M.D. for providing the data used in Sect. 6.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
In this appendix, we give proofs for Theorem 1–4.
Proof of Theorem 1
We first show the asymptotic distribution of \(\widetilde{f}_{{\text {MS}}}(t;h)\) in Theorem 1 (a). Let \(u_{i}=D_{i}R_{i}+d_{i}(1-R_{i})\) and \(f_{h}(t)=E\left[ K_{h}(t-T_{i})\mid D_{i}=1\right] \), as defined in (3.3), based on (3.1), we have
For any given h, at point t, as \(n\rightarrow \infty \) , by the Weak Law of Large Numbers (WLLN), we have
By the Central Limit Theory (CLT), we have
and
By applying Slutsky’s theorem and with the consideration of the correlation between (7.3) and (7.4), we have
where \(\sigma _{1}^{2}=\frac{1}{p^{2}}Var\left( {u}_{i}K_{h}(t-T_{i})-{u} _{i}f_{h}(t)\right) \). The asymptotic distribution of \(\widetilde{f} _{{\text {MS}}}(t;h)\) in Theorem 1(a) has been proved.
Replacing \(u_{i}=D_{i}R_{i}+d_{i}(1-R_{i})\ \) by \(u_{i}=d_{i}\), and applying a similar argument for the proof of Theorem 1 (a), we can prove Theorem 1 (b). \(\square \)
Proof of Theorem 2
Let f(t) be the density function for the diseased population, and let \(z_{i}=\left( T_{i}-t\right) /h,\) then
Based on Theorem 1, we have
Since both \(f_{h}(t)\) and f(t) are defined for the diseased population, we have
Combining (7.5) and (7.6), we have \(Bias\left[ \widetilde{f}_{{\text {MS}}}(t;h)\right] =\frac{1}{2}h^{2}\mu _{2}(K)f^{\prime \prime } (t)+o(h^{2}).\)
Next, we derive the variance of \(\widetilde{f}_{MS}(t;h).\) Let \(w(t)=E\big ( \pi _{i}d_{i}+d_{i}^{2}(1-\pi _{i})\mid T_{i}=t\big )\), \(z_{i}=\left( T_{i}-t\right) /h\) and g(t) be the population density function of T. Based on Theorem 1, the asymptotic variance for \(\widetilde{f} _{{\text {MS}}}(t)\) is
Hence, the asymptotic variance of \(\widetilde{f}_{{\text {MS}}}(t)\) is
Let \(w(t)=E\left( d_{i}^{2}\mid T_{i}=t\right) .\) Based on Theorem 1, with a similar argument, the asymptotic variance for \(\widetilde{f}_{BG}(t)\) can be derived as below:
\(\square \)
Proof of Theorem 3
We first show the asymptotic distribution of in Theorem 3 (a). Suppose we have a prediction model (3.8) and the parameters are estimated from (3.9). Let \(\widehat{{\beta }}\) be the estimate of \(\beta \). Based on the Taylor expansion of (3.9) at \(\beta \), we have
where \(\mathbf {I=-}E[\frac{\partial \Psi _{i}}{\partial {\beta }^{T}}].\) If \(\widehat{{\beta }}\) are estimated from the score equation, \(\mathbf {I}\) is the Fisher information matrix.
Since
the asymptotic distribution of \(\sqrt{n}\left[ \widetilde{f}_{{\text {MS}}} (t)-f_{h}(t)\right] \) is already given in Theorem 1(a). We will focus on deriving the asymptotic distribution of \(\sqrt{n}\left[ \widehat{f}_{{\text {MS}}}(t)-\widetilde{f}_{{\text {MS}}}(t)\right] \).
Let \(\widehat{u}_{i}=D_{i}R_{i}+\widehat{d}_{i}(1-R_{i})=D_{i}R_{i} +g(x_{i};\widehat{{\beta }})(1-R_{i}).\) Based on (3.10) and (3.1), by applying WLLE, we have
The second term
where \(\mathbf {I}=-E[\frac{\partial \Psi _{i}}{\partial {\beta }^{T}}]\).
For the first term, by Slutsky’s theorem,
Thus,
It follows that
where \(\sigma _{3}^{2}=\frac{1}{p^{2}}Var\big ( {u}_{i}K_{h}(t-T_{i} )-f_{h}(t){u}_{i}+\big ( E\left( K_{h}(t-T_{i})\frac{\partial u_{i} }{\partial {\beta }^{T}}\right) -f_{h}(t)E\big ( \frac{\partial u_{i} }{\partial {\beta }^{T}}\big ) \big ) \mathbf {I}^{-1}\Psi i\big ) \). Let \(\mathbf {c}=E\left[ K_{h}(t-T_{i})(1-R_{i})\frac{\partial g_{i} }{\partial {\beta }^{T}}({\beta })\right] \) and \(\mathbf {d} =E\left[ \!(1-R_{i})\frac{\partial g_{i}}{\partial {\beta }^{T} }({\beta })\!\right] .\) Since \(\frac{\partial u_{i}}{\partial {\beta }^{T}}=(1-R_{i})\frac{\partial g_{i}}{\partial {\beta }^{T} },\) we have
Theorem 3(b) can be proved similarly by replacing \(u_{i}\) by \(d_{i}\) and \(\widehat{u}_{i}\) by \(\widehat{d}_{i}\) in the above arguments. \(\square \)
Proof of Theorem 4
Based on Theorem 3, the asymptotic bias for both and is \(f_{h}(t)-f(t).\) Thus, the bias follows from the proof of Theorem 2.
The proof for the asymptotic variance also follows similarly to that of Theorem 2:
Let \(w(t)=E\left[ \pi _{i}d_{i}+d_{i}^{2}(1-\pi _{i})\mid T_{i}=t\right] ,\) \(w_{1}(t)=E\big \{ u_{i}\left( \mathbf {c}-f_{h}(t)\mathbf {d}\right) \mathbf {e}^{-1}\Psi _{i}\mid T_{i}=t\big \} \), and \(w_{2}(T_{i})=E\left[ (\left( \mathbf {c}-f_{h}(t)\mathbf {d}\right) \mathbf {e}^{-1}\Psi _{i} )^{2}\right] .\) Based on Theorem 3, the asymptotic variance for can be derived as below:
Hence, the asymptotic variance of is
Similarly, we can prove the asymptotic variance of in (3.16). \(\square \)
Rights and permissions
About this article
Cite this article
He, H., Wang, W. & Tang, W. Prediction model-based kernel density estimation when group membership is subject to missing. AStA Adv Stat Anal 101, 267–288 (2017). https://doi.org/10.1007/s10182-016-0283-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10182-016-0283-y