Abstract
Evaluating the impact of non-randomized treatment on various health outcomes is difficult in observational studies because of the presence of covariates that may affect both the treatment or exposure received and the outcome of interest. In the present study, we develop a semiparametric multiply robust multiple imputation method for estimating average treatment effects in such studies. Our method combines information from multiple propensity score models and outcome regression models, and is multiply robust in that it produces consistent estimators for the average causal effects if at least one of the models is correctly specified. Our proposed estimators show promising performances even with incorrect models. Compared with existing fully parametric approaches, our proposed method is more robust against model misspecifications. Compared with fully non-parametric approaches, our proposed method does not have the problem of curse of dimensionality and achieves dimension reduction by combining information from multiple models. In addition, it is less sensitive to the extreme propensity score estimates compared with inverse propensity score weighted estimators and augmented estimators. The asymptotic properties of our method are developed and the simulation study shows the advantages of our proposed method compared with some existing methods in terms of balancing efficiency, bias, and coverage probability. Rubin’s variance estimation formula can be used for estimating the variance of our proposed estimators. Finally, we apply our method to 2009–2010 National Health Nutrition and Examination Survey to examine the effect of exposure to perfluoroalkyl acids on kidney function.
Similar content being viewed by others
References
Byers T, Nestle M, McTiernan A, Doyle C, Currie-Williams A, Gansler T, Thun M (2002) American cancer society guidelines on nutrition and physical activity for cancer prevention: reducing the risk of cancer with healthy food choices and physical activity. CA A Cancer J Clin 52(2):92–119
Calle EE, Kaaks R (2004) Overweight, obesity and cancer: epidemiological evidence and proposed mechanisms. Nat Rev Cancer 4(8):579–591
Cao W, Tsiatis AA, Davidian M (2009) Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika 96(3):723–734
Chen S, Haziza D (2017) Multiply robust imputation procedures for the treatment of item nonresponse in surveys. Biometrika 104(2):439–453
Chen S, Haziza D (2019) On the nonparametric multiple imputation with multiply robustness. Stat Sin 29:2035–2053
Cohen HW, Hailpern SM, Fang J, Alderman MH (2006) Sodium intake and mortality in the NHANES II follow-up study. Am J Med 119(3):275-e7
De Luna X, Waernbaum I, Richardson TS (2011) Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika 98(4):861–875
Devroye LP, Wagner TJ (1977) The strong uniform consistency of nearest neighbor density estimates. Ann Stat 5(3):536–540
Duan X, Yin G (2017) Ensemble approaches to estimating the population mean with missing response. Scand J Stat 44(4):899–917
French SA, Hennrikus DJ, Jeffery RW (1996) Smoking status, dietary intake, and physical activity in a sample of working adults. Health Psychol 15(6):448–454
Graham JW, Olchowski AE, Gilreath TD (2007) How many imputations are really needed? some practical clarifications of multiple imputation theory. Prev Sci 8(3):206–213
Han P (2018) Calibration and multiple robustness when data are missing not at random. Stat Sin 28(4):1725–1740
Han P, Wang L (2013) Estimation with missing data: beyond double robustness. Biometrika 100(2):417–430
Healy GN, Matthews CE, Dunstan DW, Winkler EAH, Owen N (2011) Sedentary time and cardio-metabolic biomarkers in us adults: Nhanes 2003–06. Eur Heart J 32(5):590–597
Hebert JR, Kabat GC (1990) Differences in dietary intake associated with smoking status. Eur J Clin Nutr 44(3):185–193
Heitjan DF, Little RJ (1991) Multiple imputation for the fatal accident reporting system. J Roy Stat Soc: Ser C (Appl Stat) 40(1):13–29
Hernan M, Robins J (2020) Causal inference: What if. boca raton: Chapman & hill/crc
Holland PW (1986) Statistics and causal inference. J Am Stat Assoc 81(396):945–960
Hsu CH, Long Q, Li Y, Jacobs E (2014) A nonparametric multiple imputation approach for data with missing covariate values with application to colorectal adenoma data. J Biopharm Stat 24(3):634–648
Kang JD, Schafer JL et al (2007) Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Stat Sci 22(4):523–539
Kataria A, Trasande L, Trachtman H (2015) The effects of environmental chemicals on renal function. Nat Rev Nephrol 11(10):610
Kim J, Haziza D (2014) Doubly robust inference with missing data in survey sampling. Stat Sin 24(1):375–394
Levey AS, Stevens LA, Schmid CH, Zhang Y, Castro AF III, Feldman HI, Kusek JW, Eggers P, Van Lente F, Greene T (2009) A new equation to estimate glomerular filtration rate. Ann Intern Med 150(9):604–612
Little RJ, Rubin DB (2019) Statistical analysis with missing data, vol 793. John Wiley & Sons, New York
Long Q, Hsu C, Li Y (2012) Doubly robust nonparametric multiple imputation for ignorable missing data. Stat Sin 22:149–172
Lu CY (2009) Observational studies: a review of study designs, challenges and strategies to reduce confounding. Int J Clin Pract 63(5):691–697
Maura M, Boyle P, La Vecchia C, Decarli A, Talamini R, Franceschi S (1998) Population attributable risk for breast cancer: diet, nutrition, and physical exercise. JNCI J Natl Cancer Inst 90(5):389–394
Nielsen SF (2003) Proper and improper multiple imputation. Int Stat Rev 71(3):593–607
Pearl J (2009) Causal inference in statistics: an overview. Stat Surv 3:96–146
Rubin DB (1974) Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol 66(5):688–701
Rubin DB (1987) Multiple imputation for nonresponse in surveys. John Wiley & Sons, New York
Rubin DB (1990) Formal mode of statistical inference for causal effects. J Stat Plan Inference 25(3):279–292
Rubin DB (2005) Causal inference using potential outcomes: design, modeling, decisions. J Am Stat Assoc 100(469):322–331
Rubin DB, Schenker N (1991) Multiple imputation in health-are databases: an overview and some applications. Stat Med 10(4):585–598
Schafer J (1999) Multiple imputation: a primer. Stat Methods Med Res 8(1):3–15
Schafer JL (1999) Multiple imputation: a primer. Stat Methods Med Res 8(1):3–15
Shankar A, Xiao J, Ducatman A (2011) Perfluoroalkyl chemicals and chronic kidney disease in us adults. Am J Epidemiol 174(8):893–900
Silverman BW (1978) Weak and strong uniform consistency of the kernel estimate of a density and its derivatives. Ann Stat 6(1):177–184
Stone CJ (1977) Consistent nonparametric regression. Ann Stat 5(4):595–620
Van der Vaart AW (2000) Asymptotic statistics, vol 3. Cambridge University Press, Cambridge
Watkins DJ, Josson J, Elston B, Bartell SM, Shin H-M, Vieira VM, Savitz DA, Fletcher T, Wellenius GA (2013) Exposure to perfluoroalkyl acids and markers of kidney function among children and adolescents living near a chemical plant. Environ Health Perspect 121(5):625–630
Zhang S (2019) Multiply robust empirical likelihood inference for missing data and causal inference problems. University of Waterloo, Waterloo
Zhao J, Hinton P, Chen J, Jiang J (2020) Causal inference for the effect of environmental chemicals on chronic kidney disease. Comput Struct Biotechnol J 18:93–99
Acknowledgements
S. Chen was supported by the National Institute on Minority Health and Health Disparities (NIMHD) at National Institutes of Health (NIH) (1R21MD014658-01A1) and the Oklahoma Shared Clinical and Translational Resources (U54GM104938) with an Institutional Development Award (IDeA) from National Institute of General Medical Sciences. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The work of D. Haziza was supported by grants from the Natural Sciences and Engineering Research Council of Canada.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest statement
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 A: Regularity conditions
Before providing sketched proofs of Theorems 1 and 2, we will first provide some necessary regularity conditions.
Let \(\hat{Z}_{i}^{(g)}=\left( \hat{Z}_{1i}^{(g)},\hat{Z}_{2i}^{(g)}\right) \) where \(\hat{Z}_{1i}^{(g)}=\hat{m}_{g,i}\) and \(\hat{Z}_{2i}^{(g)}=\hat{p}_{g,i}\). Next, let \(\beta _{0g}^{*(k)}\),\(\alpha ^{*(j)}\), and \(\eta _{m_g}^{*}\) be the probability limits for the estimators \(\hat{\beta }_{0g}^{(k)}\), \(\;\hat{\alpha }^{(j)}\), and \(\;\hat{\eta }_{m_g}\), respectively.
Denote
with
where
and
Finally, let \(f_g(Z^{*(g)})\) be the density function of \(Z^{*(g)}\). We assume the following regularity conditions necessary for proving Theorem 1 and Theorem 2. Conditions (C1) and (C2) apply to each propensity score model, \(j=1,...,J\), and each pair of outcome regression models, \(k=1,...,K\).
-
(C1)
\(\hat{\alpha }^{(j)}\) is the unique solution of \(S_p^{(j)}(\alpha ^{(j)})=0\) and \(\;\hat{\beta }_{0g}^{(k)}\) is the unique solution of \(S_{m_g}^{(k)}(\beta _{0g}^{(k)})=0\) where \(S_p^{(j)}(\alpha ^{(j)})\) and \(S_{m_g}^{(k)}\) are as defined in Sect. 3.
-
(C2)
\(S_p^{(j)}(\alpha ^{(j)})\) converges almost surely to \(S_p^{*(j)}(\alpha ^{(j)})=E\{S_p^{(j)}(\alpha ^{(j)})\}\), uniformly in \(\alpha ^{(j)}\), and \(S_p^{*(j)}(\alpha ^{(j)})=0\) has a unique solution \(\alpha ^{*(j)}\). Also, \(S_{m_g}^{(k)}(\beta _{0g}^{(k)})\) converges almost surely to \(S_{m_g}^{*(k)}(\beta _{0g}^{(k)})=E\{S_{m_g}^{(k)}(\beta _{0g}^{(k)})\}\), uniformly in \(\beta _{0g}^{(k)}\), and \(S_{m_g}^{*(k)}(\beta _{0g}^{(k)})=0\) has a unique solution in \(\beta _{0g}^{*(k)}\).
-
(C3)
\(E(Y_g^2)<\infty \) and \(E\{\mu _g^{2}(Z^{*(g)})\}<\infty \), where \(\mu _g^{2}(Z^{*(g)})=E(Y_g\vert Z^{*(g)})\).
-
(C4)
\(H/n=o(1)\) and \(\text {log}(n)/H=o(1)\).
-
(C5)
\(f_g(Z^{*(g)})\) and \(\pi _g(Z^{*(g)})\) are continuous and bounded away from 0 in the compact support of \(Z^{*(g)}\).
The consistency of \(\hat{\alpha }^{(j)}\) and \(\hat{\beta }_{0g}^{(k)}\) is ensured by Conditions (C1) and (C2). These conditions are satisfied for most linear (and generalized linear) models. Condition (C3) is useful for deriving the asymptotic expansion and normality of \(\hat{\tau }_{MRMI}\). Condition (C4) is used to control the asymptotic order of H. Condition (C5), a common condition in nonparametric statistics, helps avoid extreme values of the propensity and density scores.
1.2 B: Sketched proof of Theorem 1
Let \(\hat{\alpha }=(\hat{\alpha }^{(1)},...,\hat{\alpha }^{(J)})\), \(\hat{\beta }_{0g}=(\hat{\beta }_{0g}^{(1)},...,\hat{\beta }_{0g}^{(K)})\), \(\hat{\alpha }^*=(\hat{\alpha }^{*(1)},...,\hat{\alpha }^{*(J)})\), and \(\hat{\beta }^*_{0g}=(\hat{\beta }_{0g}^{*(1)},...,\hat{\beta }_{0g}^{*(K)})\). According to (C1), (C2), and Van der Vaart (2000), it can be shown that \(\hat{\alpha }\rightarrow ^{p}\alpha ^*\) and \(\hat{\beta }_{0g}\rightarrow ^{p}\beta _{0g}^{*}\).
Assume that one of the pairs of outcome regression models is correct, say \(m_1^{(1)}(X_i;\beta _{01}^{(1)})\) and \(m_0^{(1)}(X_i;\beta _{00}^{(0)})\). Then, we have \(\beta _{0g}^{*(1)}=\beta _{0g}\) and \(\;\eta _{m_g}^{*}=(1,0,...,0)^{T}\), which implies that \(Z_{1i}^{*(g)}=m_g(X_i;\beta _{0g})\).
It follows that
If one of the propensity score models is correctly specified, say \(p_1^{(1)}(X_i;\alpha ^{(1)})\), then \(\alpha ^{*(1)}=\alpha \),\(\;\eta _p^*=(1,0,...,0)^\top \) and \(\;Z_{2i}^{*(1)}=p_1(X_i;\alpha )\). Therefore,
due to the fact that \(Y_1\) is independent of T given \(Z_2^{*(1)}\).
According to Devroye and Wagner (1977) and Silverman (1978), it can be shown that
and
uniformly for \(i\in s\) as \(n \rightarrow \infty \), \(L \rightarrow \infty \), \(H \rightarrow \infty \), and conditions (C4) and (C5). In addition, according to Chebyshev’s inequality, (B.1), and (B.2) we have
According to (B.3), we have
as \(n \rightarrow \infty \), \(L \rightarrow \infty \), and \(H \rightarrow \infty \). Therefore, if at least one of the propensity score models or one of the pairs of regression models is correctly specified, according to (B.1) and (B.4), we have
as \(n \rightarrow \infty \), \(L \rightarrow \infty \), and \(H \rightarrow \infty \), where we used the consistent probability weights argument from Stone (1977) in the derivation.
It follows by similar argument that \(E(\hat{\mu }_{0MRMI}) \rightarrow ^p E(Y_0)\) as \(n \rightarrow \infty \), \(L \rightarrow \infty \), and \(H \rightarrow \infty \).
Then
as \(n \rightarrow \infty \), \(L \rightarrow \infty \), and \(H \rightarrow \infty \). Therefore, Theorem 1 is proven.
1.3 C: Sketched proof of Theorem 2
We can write \(\hat{\mu }_{1MRMI}\) as
where
and
Then, using a first-order Taylor expansion of \(\hat{\mu }_1\) about \(\theta _1=\theta ^*_1\), we obtain
where \(\partial \hat{\mu }_1^*/\partial \theta _1\) is \(\partial \hat{\mu }_1/\partial \theta _1\) evaluated at \(\theta ^*_1\). Because \(\theta _1\) is the solution of the estimating equation \(S_{m_1p}(\theta _1)=0\), one can show that
Plugging (C.4) into (C.3) yields
where \(A_1\) is defined in Theorem 2 of Sect. 4. Following a similar argument to that of Long et al. (2012), given regularity conditions (C1)–(C5), one can show that
where
Because \(\hat{\mu }_{1MRMI}-\hat{\mu }_1\) and \(\hat{\mu }_1\) are asymptotically independent, we have
where \(\bar{Y}_{1R_{H^{(1)}(i)}}=H^{-1}\sum _{j \in R_H^{(1)}(i)}Y_{1j}\). According to the asymptotic independence between \(\hat{\mu }_{1MRMI}-\hat{\mu }_1\) and \(\hat{\mu }_1\), \(Q_1\) and \(Q_2\), \(Q_1\) and \(Q_3\), Eqs. (C.1), (C.2), (C.5)–(C.7), and regularity condition (C3), we have
as \(n \rightarrow \infty \) and \(L \rightarrow \infty \).
We may apply a similar argument for \(\hat{\mu }_{0MRMI}\) to obtain
as \(n \rightarrow \infty \) and \(L \rightarrow \infty \).
Then, because \(\hat{\tau }_{MRMI}=\hat{\mu }_{1MRMI}-\hat{\mu }_{0MRMI}\), we have
Then, by the Central Limit Theorem, we obtain
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gochanour, B., Chen, S., Beebe, L. et al. A semiparametric multiply robust multiple imputation method for causal inference. Metrika 86, 517–542 (2023). https://doi.org/10.1007/s00184-022-00883-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00184-022-00883-0