Abstract
Recently, machine learning (ML) methods have been used in causal inference to estimate treatment effects in order to reduce concerns for model mis-specification. However, many ML methods require that all confounders are measured to consistently estimate treatment effects. In this paper, we propose a family of ML methods that estimate treatment effects in the presence of cluster-level unmeasured confounders, a type of unmeasured confounders that are shared within each cluster and are common in multilevel observational studies. We show through simulation studies that our proposed methods are robust from biases from unmeasured cluster-level confounders in a variety of multilevel observational studies. We also examine the effect of taking an algebra course on math achievement scores from the Early Childhood Longitudinal Study, a multilevel observational educational study, using our methods. The proposed methods are available in the CURobustML R package.
Similar content being viewed by others
References
Arkhangelsky, D., & Imbens, G. (2019). The role of the propensity score in fixed effect models. arXiv. Retrieved from arxiv: 1807.02099. https://doi.org/10.3386/w24814
Arpino, B., & Cannas, M. (2016). Propensity score matching with clustered data. an application to the estimation of the impact of caesarean section on the apgar score. Statistics in Medicine, 35(12), 2074–2091. https://doi.org/10.1002/sim.6880
Arpino, B., & Mealli, F. (2011). The specification of the propensity score in multilevel observational studies. Computational Statistics & Data Analysis, 55(4), 1770–1780. https://doi.org/10.1016/j.csda.2010.11.008
Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27), 7353–7360. https://doi.org/10.1073/pnas.1510489113
Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research, 46, 399–424. https://doi.org/10.1080/00273171.2011.568786
Bang, H., & Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4), 962–973. https://doi.org/10.1111/j.1541-0420.2005.00377.x
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01.
Carvalho, C., Feller, A., Murray, J., Woody, S., & Yeager, D. (2019). Assessing treatment effect variation in observational studies: Results from a data challenge. Observational Studies, 5, 21–35. https://doi.org/10.1353/obs.2019.0000
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. https://doi.org/10.1111/ectj.12097
Ding, P., Feller, A., & Miratrix, L. (2019). Decomposing treatment effect variation. Journal of the American Statistical Association, 114(525), 304–317. https://doi.org/10.1080/01621459.2017.1407322
Dorie, V., & Hill, J. (2019). bartcause: Causal inference using bayesian additive regression trees [Computer software manual]. Retrieved from https://github.com/vdorie/bartCause (R package version 1.0-0)
Dorie, V., Hill, J., Shalit, U., Scott, M., & Cervone, D. (2019, 02). Automated versus do-it-yourself methods for causal inference Lessons: Learned from a data analysis competition. Statistical Science, 34(1), 43–68. https://doi.org/10.1214/18-STS667
Evdokimov, K. (2010). Identification and estimation of a nonparametric panel data model with unobserved heterogeneity. Working paper, Princeton University.
Firebaugh, G., Warner, C., & Massoglia, M. (2013). Fixed effects, random effects, and hybrid models for causal analysis. In S. L. Morgan (Ed.), Handbook of causal analysis for social research (pp. 113–132). Springer. https://doi.org/10.1007/978-94-007-6094-3_7.
Glynn, A. N., & Quinn, K. M. (2010). An introduction to the augmented inverse propensity weighted estimator. Political Analysis, 18(1), 36–56. https://doi.org/10.1093/pan/mpp036
Gruber, S., & van der Laan, M. J. (2012). tmle: An R package for targeted maximum likelihood estimation. Journal of Statistical Software, 51(13), 1–35. Retrieved from http://www.jstatsoft.org/v51/i13/. https://doi.org/10.18637/jss.v051.i13
Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica: Journal of the Econometric Society, pp. 1029–1054. https://doi.org/10.2307/1912775
He, Z. (2018). Inverse conditional probability weighting with clustered data in causal inference. arXiv. Retrieved from arxiv: 1808.01647
Henderson, D. J., Carroll, R. J., & Li, Q. (2008). Nonparametric estimation and testing of fixed effects panel data models. Journal of Econometrics, 144(1), 257–275. https://doi.org/10.1016/j.jeconom.2008.01.005
Hernan, M. A., & Robins, J. M. (2020). Causal inference: What if. Boca Raton: Chapman & HallCRC.
Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1), 217–240. https://doi.org/10.1198/jcgs.2010.08162
Holloway, J. H. (2004). Closing the minority achievement gap in math. Educational Leadership, 61(5), 84.
Hong, G., & Hong, Y. (2009). Reading instruction time and homogeneous grouping in kindergarten: An application of marginal mean weighting through stratification. Educational Evaluation and Policy Analysis, 31(1), 54–81. https://doi.org/10.3102/0162373708328259
Hong, G., & Raudenbush, S. W. (2006). Evaluating kindergarten retention policy: A case study of causal inference for multilevel observational data. Journal of the American Statistical Association, 101, 901–910. https://doi.org/10.1198/016214506000000447
Hong, G., & Raudenbush, S. W. (2013). Heterogeneous agents, social interactions, and causal inference. In S. L. Morgan (Ed.), Handbook of causal analysis for social research (pp. 331–352). Springer. https://doi.org/10.1007/978--94--007--6094--3_16
Imai, K., & Kim, I. S. (2019). When should we use unit fixed effects regression models for causal inference with longitudinal data? American Journal of Political Science, 63(2), 467–490. https://doi.org/10.1111/ajps.12417
Kim, J.-S., & Frees, E. W. (2006). Omitted variables in multilevel models. Psychometrika, 71(4), 659. https://doi.org/10.1007/s11336-005-1283-0
Kim, J.-S., & Frees, E. W. (2007). Multilevel modeling with correlated effects. Psychometrika, 72(4), 505–533. https://doi.org/10.1007/s11336-007-9008-1
Künzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10), 4156–4165. https://doi.org/10.1073/pnas.1804597116
LeDell, E., Gill, N., Aiello, S., Fu, A., Candel, A., Click, C., . . . Malohlava, M. (2020). h2o: R interface for the ’h2o’ scalable machine learning platform [Computer software manual]. Retrieved from https://github.com/h2oai/h2o-3 (R package version 3.30.1.1)
Lee, Y., Nguyen, T. Q., & Stuart, E. A. (2019). Partially pooled propensity score models for average treatment effect estimation with multilevel data. arXiv Retrieved from arxiv:1910.05600
Li, F., Zaslavsky, A. M., & Landrum, M. B. (2013). Propensity score weighting with multilevel data. Statistics in Medicine, 32(19), 3373–3387. https://doi.org/10.1002/sim.5786
Li, Y., Lee, Y., Port, F. K., & Robinson, B. M. (2020). The impact of unmeasured within-and between-cluster confounding on the bias of effect estimators of a continuous exposure. Statistical Methods in Medical Research, 29(8), 2119–2139. https://doi.org/10.1177/0962280219883323
Lin, Z., Li, Q., & Sun, Y. (2014). A consistent nonparametric test of parametric regression functional form in fixed effects panel data models. Journal of Econometrics, 178, 167–179. https://doi.org/10.1016/j.jeconom.2013.08.014
McCaffrey, D. F., Ridgeway, G., & Morral, A. R. (2004). Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychological Methods, 9(4), 403–425. https://doi.org/10.1037/1082-989X.9.4.403
Meyers, J. L., & Beretvas, S. N. (2006). The impact of inappropriate modeling of cross-classified data structures. Multivariate Behavioral Research, 41(4), 473–497. https://doi.org/10.1207/s15327906mbr4104_3
Mullen, K. M., & van Stokkum, I. H. M. (2012). nnls: The lawson-hanson algorithm for non-negative least squares (nnls) [Computer software manual]. Retrieved from https://CRAN.R-project.org/package=nnls (R package version 1.4)
Neyman, J. S. (1923). On the application of probability theory to agricultural experiments: Essay on principles. section 9 (with discussion). Statistical Science, 4, 465–480.
Noguera, P. A., & Wing, J. Y. (2008). Unfinished business: Closing the racial achievement gap in our schools. Wiley.
Polley, E. C., & van der Laan, M. J. (2010). Super learner in prediction. U.C. Berkeley Division of Biostatistics Working Paper Series. Paper 226. https://doi.org/10.1007/978-1-4419-9782-1_3
R Core Team. (2020). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Retrieved from https://www.R-project.org/
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (Vol. 1). Sage.
Rickles, J. H. (2013). Examining heterogeneity in the effect of taking algebra in eighth grade. The Journal of Educational Research, 106(4), 251–268. https://doi.org/10.1080/00220671.2012.692731
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. https://doi.org/10.1093/biomet/70.1.41
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5), 688–701. https://doi.org/10.1080/01621459.1986.10478355
Rubin, D. B. (1986). Comment: Which ifs have causal answers. Journal of the American Statistical Association, 81(396), 961–962. https://doi.org/10.2307/2289065
Rubin, D. B. (2001). Using propensity scores to help design observational studies: Application to the tobacco litigation. Health Services and Outcomes Research Methodology, 2(3–4), 169–188.
Schafer, J. L., & Kang, J. (2008). Average causal effects from nonrandomized studies: A practical guide and simulated example. Psychological Methods, 13(4), 279–313. https://doi.org/10.1037/a0014268
Semenova, V., & Chernozhukov, V. (2020). Estimation and inference about conditional average treatment effect and other structural functions. arXiv Retrieved from arxiv: 1702.06240
Shadish, W. R., Clark, M. H., & Steiner, P. M. (2008). Can nonrandomized experiments yield accurate answers? A randomized experiment comparing random and nonrandom assignments. Journal of the American Statistical Association, 103(484), 1334–1344. https://doi.org/10.1198/016214508000000733
Steiner, P. M., & Cook, D. (2013). Matching and propensity scores. In T. Little (Ed.), The oxford handbook of quantitative methods (p. 236–258). New York, NY: Oxford University Press. https://doi.org/10.1093/oxfordhb/9780199934874.013.0013
Steiner, P. M., Cook, T. D., Shadish, W. R., & Clark, M. H. (2010). The importance of covariate selection in controlling for selection bias in observational studies. Psychological Methods, 15(3), 250. https://doi.org/10.1037/a0018719
Su, X., Tsai, C.-L., Wang, H., Nickerson, D. M., & Li, B. (2009). Subgroup analysis via recursive partitioning. Journal of Machine Learning Research, 10(2), 141–158. https://doi.org/10.2139/ssrn.1341380
Suk, Y., Kang, H., & Kim, J.-S. (2020). Random forests approach for causal inference with clustered observational data. Multivariate Behavioral Research. https://doi.org/10.1080/00273171.2020.1808437
Sun, Y., Carroll, R. J., & Li, D. (2009). Semiparametric estimation of fixed-effects panel data varying coefficient models. In Q. Li & J. S. Racine (Eds.), Nonparametric econometric methods (pp. 101–130). Emerald Group Publishing Limited. https://doi.org/10.1108/S0731-9053(2009)0000025006
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03.
van der Laan, M. J., Polley, E. C., & Hubbard, A. E. (2007). Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1). https://doi.org/10.2202/1544-6115.1309.
van der Laan, M. J., & Rose, S. (2011). Targeted learning: Causal inference for observational and experimental data. Springer.
Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228–1242. https://doi.org/10.1080/01621459.2017.1319839
Walston, J., & McCarroll, J. C. (2010). Eighth-grade algebra: Findings from the eighth-grade round of the early childhood longitudinal study, kindergarten class of 1998–99 (ECLS-K). statistics in brief. nces 2010–016. National Center for Education Statistics.
Wenglinsky, H. (2004). Closing the racial achievement gap: The role of reforming instructional practices. Education Policy Analysis Archives, 12, 64. https://doi.org/10.14507/epaa.v12n64.2004.
Westreich, D., Lessler, J., & Funk, M. J. (2010). Propensity score estimation: neural networks, support vector machines, decision trees (cart), and meta-classifiers as alternatives to logistic regression. Journal of Clinical Epidemiology, 63(8), 826–833. https://doi.org/10.1016/j.jclinepi.2009.11.020
White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399. https://doi.org/10.1002/sim.4067
Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. The MIT press.
Wooldridge, J. M. (2012). Introductory econometrics: A modern approach. South-Western Cengage Learning.
Yang, S. (2018). Propensity score weighting for causal inference with clustered data. Journal of Causal Inference, 6(2). https://doi.org/10.1515/jci-2017-0027.
Zetterqvist, J., & Sjölander, A. (2015). Doubly robust estimation with the R package drgee. Epidemiologic Methods, 4(1), 69–86. https://doi.org/10.1515/em-2014-0021
Zetterqvist, J., Vansteelandt, S., Pawitan, Y., & Sjölander, A. (2016). Doubly robust methods for handling confounding by cluster. Biostatistics, 17(2), 264–276. https://doi.org/10.1093/biostatistics/kxv041
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendices
A. Ensemble Learning Algorithms
We provide additional details on the learning algorithms used in our three approaches. The first and third approach, denoted as PR and DDPR in the main paper, use the same learning algorithms. Their learning algorithms, denoted as Algorithms 2 and 3, are detailed below.
For the second approach, denoted as DD in the main paper, we estimate \(g_x^*\) and \(h_x^*\) by using Algorithm 4 and Algorithm 5, respectively. Here, we implement a single generalized linear model because we use a heuristic suggested in the main text that includes both the demeaned variables \({\mathbf {X}}_{ij}^*\) and variable means \(\overline{{\mathbf {X}}}_{j}\).
B. Simulation Results for BART and TMLE with Cluster Dummies
In this section, we performed additional simulations with BART and TMLE under Designs 1 and 4. For Design 1, we included cluster dummies (denoted as id) as additional predictors into BART and TMLE. For Design 4, we included two different sets of cluster indicators into BART and TMLE: (i) factor-1-by-factor-2 dummies (denoted as f12id) and (ii) factor-1 dummies and factor-2 dummies (denoted as f1id and f12, respectively).
Figures 6 and 7 summarize the results under Design 1 and Design 4, respectively. From Fig. 6, we observed that including an additional set of cluster dummies did not always help remove bias and RMSE for \(\varvec{\beta }\), and both BART and TMLE with cluster dummies (i.e., BART+id, TMLE+id) still yielded higher bias and RMSE for \(\mathbf {\beta }\) and \(\tau \), compared to our proposed estimators. Similarly, as shown in Fig. 7, including either cluster indicators in BART or TMLE generally did not always remove additional biases from the effect estimates and their biases were still worse than our proposed estimators.
C. Convergence Properties of Proposed Methods Under Design 1
Figure 8 shows the RMSE of our proposed estimators under Design 1 when we increase the cluster size and vary the outcome and the propensity score model specifications. Overall, we see that the RMSE from all three estimators decrease as the cluster size increases so long as the outcome model or the propensity score model is correct, numerically suggesting the double robustness property of our proposed estimators.
D. Simulation Results for Designs 2, 3, and 4
In the following subsections, we provide Tables 6, 7, 8, 9, 10, 11, 12 and 13. Tables 6, 7, 8, 9, 10 and 11 show the performance of our estimators with different estimators for the outcome model and the propensity score.
Tables 7 and 13 replicate Table 2 for Design 2 and Design 4 in order to evaluate the robustness of our methods under different specifications of the propensity score and the outcome model if our estimators for them were unfortunately mis-specified. Overall, as long as we correctly specified either the treatment model or the outcome model, the effect estimates of \(\varvec{\beta }\) and \(\tau \) from our proposed estimators had almost no bias. In contrast, when both the treatment and outcome models were mis-specified, all the estimators were biased. Also, we observed that mis-specifying the outcome model led to estimates with higher variance compared to mis-specifying the treatment model.
1.1 D.1. Design 2: Two-level Data, Cross-level Interaction, and Large Cluster Size
1.2 D.2. Design 3: Two-level Data, No Cross-level Interaction, and Small Cluster Size
1.3 D.3. Design 4: Cross-classified data, no cross-level interaction, and large cluster size
E. Covariance Balance
We checked covariate balance with respect to the absolute standardized mean differences and variance ratios between the treatment group and control group. As a rule of thumb, if the mean difference of each covariate is less than 0.1 standard deviation and the variance ratio lies between 4/5 and 5/4, we can achieve good balance of the covariates (Rubin, 2001; Shadish, Clark, & Steiner, 2008; Steiner, Cook, Shadish, & Clark, 2010) . We observed that there was initial imbalance in covariates. After using the marginal mean weighting through stratification (MMW-S) (Hong & Hong, 2009) with the propensity scores obtained from the PR estimator, we achieved acceptable covariate balance between the treated and untreated groups (Fig. 9).
Rights and permissions
About this article
Cite this article
Suk, Y., Kang, H. Robust Machine Learning for Treatment Effects in Multilevel Observational Studies Under Cluster-level Unmeasured Confounding. Psychometrika 87, 310–343 (2022). https://doi.org/10.1007/s11336-021-09805-x
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11336-021-09805-x