Robust Machine Learning for Treatment Effects in Multilevel Observational Studies Under Cluster-level Unmeasured Confounding

Suk, Youmi; Kang, Hyunseung

doi:10.1007/s11336-021-09805-x

Robust Machine Learning for Treatment Effects in Multilevel Observational Studies Under Cluster-level Unmeasured Confounding

Theory and Methods
Published: 15 October 2021

Volume 87, pages 310–343, (2022)
Cite this article

Psychometrika Aims and scope Submit manuscript

971 Accesses
6 Citations
2 Altmetric
Explore all metrics

Abstract

Recently, machine learning (ML) methods have been used in causal inference to estimate treatment effects in order to reduce concerns for model mis-specification. However, many ML methods require that all confounders are measured to consistently estimate treatment effects. In this paper, we propose a family of ML methods that estimate treatment effects in the presence of cluster-level unmeasured confounders, a type of unmeasured confounders that are shared within each cluster and are common in multilevel observational studies. We show through simulation studies that our proposed methods are robust from biases from unmeasured cluster-level confounders in a variety of multilevel observational studies. We also examine the effect of taking an algebra course on math achievement scores from the Early Childhood Longitudinal Study, a multilevel observational educational study, using our methods. The proposed methods are available in the CURobustML R package.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Concepts and Applications of Multivariate Multilevel (MVML) Analysis and Multilevel Structural Equation Modeling (MLSEM)

Analyzing International Large-Scale Assessment Data with a Hierarchical Approach

Notes

References

Arkhangelsky, D., & Imbens, G. (2019). The role of the propensity score in fixed effect models. arXiv. Retrieved from arxiv: 1807.02099. https://doi.org/10.3386/w24814
Arpino, B., & Cannas, M. (2016). Propensity score matching with clustered data. an application to the estimation of the impact of caesarean section on the apgar score. Statistics in Medicine, 35(12), 2074–2091. https://doi.org/10.1002/sim.6880
Arpino, B., & Mealli, F. (2011). The specification of the propensity score in multilevel observational studies. Computational Statistics & Data Analysis, 55(4), 1770–1780. https://doi.org/10.1016/j.csda.2010.11.008
Article Google Scholar
Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27), 7353–7360. https://doi.org/10.1073/pnas.1510489113
Article Google Scholar
Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research, 46, 399–424. https://doi.org/10.1080/00273171.2011.568786
Article PubMed PubMed Central Google Scholar
Bang, H., & Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4), 962–973. https://doi.org/10.1111/j.1541-0420.2005.00377.x
Article PubMed Google Scholar
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01.
Carvalho, C., Feller, A., Murray, J., Woody, S., & Yeager, D. (2019). Assessing treatment effect variation in observational studies: Results from a data challenge. Observational Studies, 5, 21–35. https://doi.org/10.1353/obs.2019.0000
Article Google Scholar
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. https://doi.org/10.1111/ectj.12097
Article Google Scholar
Ding, P., Feller, A., & Miratrix, L. (2019). Decomposing treatment effect variation. Journal of the American Statistical Association, 114(525), 304–317. https://doi.org/10.1080/01621459.2017.1407322
Article Google Scholar
Dorie, V., & Hill, J. (2019). bartcause: Causal inference using bayesian additive regression trees [Computer software manual]. Retrieved from https://github.com/vdorie/bartCause (R package version 1.0-0)
Dorie, V., Hill, J., Shalit, U., Scott, M., & Cervone, D. (2019, 02). Automated versus do-it-yourself methods for causal inference Lessons: Learned from a data analysis competition. Statistical Science, 34(1), 43–68. https://doi.org/10.1214/18-STS667
Evdokimov, K. (2010). Identification and estimation of a nonparametric panel data model with unobserved heterogeneity. Working paper, Princeton University.
Firebaugh, G., Warner, C., & Massoglia, M. (2013). Fixed effects, random effects, and hybrid models for causal analysis. In S. L. Morgan (Ed.), Handbook of causal analysis for social research (pp. 113–132). Springer. https://doi.org/10.1007/978-94-007-6094-3_7.
Glynn, A. N., & Quinn, K. M. (2010). An introduction to the augmented inverse propensity weighted estimator. Political Analysis, 18(1), 36–56. https://doi.org/10.1093/pan/mpp036
Article Google Scholar
Gruber, S., & van der Laan, M. J. (2012). tmle: An R package for targeted maximum likelihood estimation. Journal of Statistical Software, 51(13), 1–35. Retrieved from http://www.jstatsoft.org/v51/i13/. https://doi.org/10.18637/jss.v051.i13
Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica: Journal of the Econometric Society, pp. 1029–1054. https://doi.org/10.2307/1912775
He, Z. (2018). Inverse conditional probability weighting with clustered data in causal inference. arXiv. Retrieved from arxiv: 1808.01647
Henderson, D. J., Carroll, R. J., & Li, Q. (2008). Nonparametric estimation and testing of fixed effects panel data models. Journal of Econometrics, 144(1), 257–275. https://doi.org/10.1016/j.jeconom.2008.01.005
Article PubMed PubMed Central Google Scholar
Hernan, M. A., & Robins, J. M. (2020). Causal inference: What if. Boca Raton: Chapman & HallCRC.
Google Scholar
Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1), 217–240. https://doi.org/10.1198/jcgs.2010.08162
Article Google Scholar
Holloway, J. H. (2004). Closing the minority achievement gap in math. Educational Leadership, 61(5), 84.
Google Scholar
Hong, G., & Hong, Y. (2009). Reading instruction time and homogeneous grouping in kindergarten: An application of marginal mean weighting through stratification. Educational Evaluation and Policy Analysis, 31(1), 54–81. https://doi.org/10.3102/0162373708328259
Article Google Scholar
Hong, G., & Raudenbush, S. W. (2006). Evaluating kindergarten retention policy: A case study of causal inference for multilevel observational data. Journal of the American Statistical Association, 101, 901–910. https://doi.org/10.1198/016214506000000447
Article Google Scholar
Hong, G., & Raudenbush, S. W. (2013). Heterogeneous agents, social interactions, and causal inference. In S. L. Morgan (Ed.), Handbook of causal analysis for social research (pp. 331–352). Springer. https://doi.org/10.1007/978--94--007--6094--3_16
Imai, K., & Kim, I. S. (2019). When should we use unit fixed effects regression models for causal inference with longitudinal data? American Journal of Political Science, 63(2), 467–490. https://doi.org/10.1111/ajps.12417
Article Google Scholar
Kim, J.-S., & Frees, E. W. (2006). Omitted variables in multilevel models. Psychometrika, 71(4), 659. https://doi.org/10.1007/s11336-005-1283-0
Article Google Scholar
Kim, J.-S., & Frees, E. W. (2007). Multilevel modeling with correlated effects. Psychometrika, 72(4), 505–533. https://doi.org/10.1007/s11336-007-9008-1
Article Google Scholar
Künzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10), 4156–4165. https://doi.org/10.1073/pnas.1804597116
Article Google Scholar
LeDell, E., Gill, N., Aiello, S., Fu, A., Candel, A., Click, C., . . . Malohlava, M. (2020). h2o: R interface for the ’h2o’ scalable machine learning platform [Computer software manual]. Retrieved from https://github.com/h2oai/h2o-3 (R package version 3.30.1.1)
Lee, Y., Nguyen, T. Q., & Stuart, E. A. (2019). Partially pooled propensity score models for average treatment effect estimation with multilevel data. arXiv Retrieved from arxiv:1910.05600
Li, F., Zaslavsky, A. M., & Landrum, M. B. (2013). Propensity score weighting with multilevel data. Statistics in Medicine, 32(19), 3373–3387. https://doi.org/10.1002/sim.5786
Article PubMed PubMed Central Google Scholar
Li, Y., Lee, Y., Port, F. K., & Robinson, B. M. (2020). The impact of unmeasured within-and between-cluster confounding on the bias of effect estimators of a continuous exposure. Statistical Methods in Medical Research, 29(8), 2119–2139. https://doi.org/10.1177/0962280219883323
Article PubMed Google Scholar
Lin, Z., Li, Q., & Sun, Y. (2014). A consistent nonparametric test of parametric regression functional form in fixed effects panel data models. Journal of Econometrics, 178, 167–179. https://doi.org/10.1016/j.jeconom.2013.08.014
Article Google Scholar
McCaffrey, D. F., Ridgeway, G., & Morral, A. R. (2004). Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychological Methods, 9(4), 403–425. https://doi.org/10.1037/1082-989X.9.4.403
Article PubMed Google Scholar
Meyers, J. L., & Beretvas, S. N. (2006). The impact of inappropriate modeling of cross-classified data structures. Multivariate Behavioral Research, 41(4), 473–497. https://doi.org/10.1207/s15327906mbr4104_3
Article PubMed Google Scholar
Mullen, K. M., & van Stokkum, I. H. M. (2012). nnls: The lawson-hanson algorithm for non-negative least squares (nnls) [Computer software manual]. Retrieved from https://CRAN.R-project.org/package=nnls (R package version 1.4)
Neyman, J. S. (1923). On the application of probability theory to agricultural experiments: Essay on principles. section 9 (with discussion). Statistical Science, 4, 465–480.
Google Scholar
Noguera, P. A., & Wing, J. Y. (2008). Unfinished business: Closing the racial achievement gap in our schools. Wiley.
Polley, E. C., & van der Laan, M. J. (2010). Super learner in prediction. U.C. Berkeley Division of Biostatistics Working Paper Series. Paper 226. https://doi.org/10.1007/978-1-4419-9782-1_3
R Core Team. (2020). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Retrieved from https://www.R-project.org/
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (Vol. 1). Sage.
Rickles, J. H. (2013). Examining heterogeneity in the effect of taking algebra in eighth grade. The Journal of Educational Research, 106(4), 251–268. https://doi.org/10.1080/00220671.2012.692731
Article Google Scholar
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. https://doi.org/10.1093/biomet/70.1.41
Article Google Scholar
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5), 688–701. https://doi.org/10.1080/01621459.1986.10478355
Article Google Scholar
Rubin, D. B. (1986). Comment: Which ifs have causal answers. Journal of the American Statistical Association, 81(396), 961–962. https://doi.org/10.2307/2289065
Article Google Scholar
Rubin, D. B. (2001). Using propensity scores to help design observational studies: Application to the tobacco litigation. Health Services and Outcomes Research Methodology, 2(3–4), 169–188.
Article Google Scholar
Schafer, J. L., & Kang, J. (2008). Average causal effects from nonrandomized studies: A practical guide and simulated example. Psychological Methods, 13(4), 279–313. https://doi.org/10.1037/a0014268
Article PubMed Google Scholar
Semenova, V., & Chernozhukov, V. (2020). Estimation and inference about conditional average treatment effect and other structural functions. arXiv Retrieved from arxiv: 1702.06240
Shadish, W. R., Clark, M. H., & Steiner, P. M. (2008). Can nonrandomized experiments yield accurate answers? A randomized experiment comparing random and nonrandom assignments. Journal of the American Statistical Association, 103(484), 1334–1344. https://doi.org/10.1198/016214508000000733
Article Google Scholar
Steiner, P. M., & Cook, D. (2013). Matching and propensity scores. In T. Little (Ed.), The oxford handbook of quantitative methods (p. 236–258). New York, NY: Oxford University Press. https://doi.org/10.1093/oxfordhb/9780199934874.013.0013
Steiner, P. M., Cook, T. D., Shadish, W. R., & Clark, M. H. (2010). The importance of covariate selection in controlling for selection bias in observational studies. Psychological Methods, 15(3), 250. https://doi.org/10.1037/a0018719
Article PubMed Google Scholar
Su, X., Tsai, C.-L., Wang, H., Nickerson, D. M., & Li, B. (2009). Subgroup analysis via recursive partitioning. Journal of Machine Learning Research, 10(2), 141–158. https://doi.org/10.2139/ssrn.1341380
Article Google Scholar
Suk, Y., Kang, H., & Kim, J.-S. (2020). Random forests approach for causal inference with clustered observational data. Multivariate Behavioral Research. https://doi.org/10.1080/00273171.2020.1808437
Article PubMed Google Scholar
Sun, Y., Carroll, R. J., & Li, D. (2009). Semiparametric estimation of fixed-effects panel data varying coefficient models. In Q. Li & J. S. Racine (Eds.), Nonparametric econometric methods (pp. 101–130). Emerald Group Publishing Limited. https://doi.org/10.1108/S0731-9053(2009)0000025006
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03.
van der Laan, M. J., Polley, E. C., & Hubbard, A. E. (2007). Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1). https://doi.org/10.2202/1544-6115.1309.
van der Laan, M. J., & Rose, S. (2011). Targeted learning: Causal inference for observational and experimental data. Springer.
Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228–1242. https://doi.org/10.1080/01621459.2017.1319839
Article Google Scholar
Walston, J., & McCarroll, J. C. (2010). Eighth-grade algebra: Findings from the eighth-grade round of the early childhood longitudinal study, kindergarten class of 1998–99 (ECLS-K). statistics in brief. nces 2010–016. National Center for Education Statistics.
Wenglinsky, H. (2004). Closing the racial achievement gap: The role of reforming instructional practices. Education Policy Analysis Archives, 12, 64. https://doi.org/10.14507/epaa.v12n64.2004.
Westreich, D., Lessler, J., & Funk, M. J. (2010). Propensity score estimation: neural networks, support vector machines, decision trees (cart), and meta-classifiers as alternatives to logistic regression. Journal of Clinical Epidemiology, 63(8), 826–833. https://doi.org/10.1016/j.jclinepi.2009.11.020
Article PubMed PubMed Central Google Scholar
White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399. https://doi.org/10.1002/sim.4067
Article PubMed Google Scholar
Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. The MIT press.
Wooldridge, J. M. (2012). Introductory econometrics: A modern approach. South-Western Cengage Learning.
Yang, S. (2018). Propensity score weighting for causal inference with clustered data. Journal of Causal Inference, 6(2). https://doi.org/10.1515/jci-2017-0027.
Zetterqvist, J., & Sjölander, A. (2015). Doubly robust estimation with the R package drgee. Epidemiologic Methods, 4(1), 69–86. https://doi.org/10.1515/em-2014-0021
Article Google Scholar
Zetterqvist, J., Vansteelandt, S., Pawitan, Y., & Sjölander, A. (2016). Doubly robust methods for handling confounding by cluster. Biostatistics, 17(2), 264–276. https://doi.org/10.1093/biostatistics/kxv041
Article PubMed Google Scholar

Download references

Author information

Authors and Affiliations

School of Data Science, University of Virginia, 31 Bonnycastle Dr, Charlottesville, VA, 22903, USA
Youmi Suk
Department of Statistics, University of Wisconsin-Madison, Madison, WI, USA
Hyunseung Kang

Authors

Youmi Suk
View author publications
You can also search for this author in PubMed Google Scholar
Hyunseung Kang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Youmi Suk.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (r 20 KB)

Supplementary material 2 (r 6 KB)

Supplementary material 3 (r 37 KB)

Supplementary material 4 (csv 446 KB)

Supplementary material 5 (r 5 KB)

Appendices

A. Ensemble Learning Algorithms

We provide additional details on the learning algorithms used in our three approaches. The first and third approach, denoted as PR and DDPR in the main paper, use the same learning algorithms. Their learning algorithms, denoted as Algorithms 2 and 3, are detailed below.

For the second approach, denoted as DD in the main paper, we estimate \(g_x^*\) and \(h_x^*\) by using Algorithm 4 and Algorithm 5, respectively. Here, we implement a single generalized linear model because we use a heuristic suggested in the main text that includes both the demeaned variables \({\mathbf {X}}_{ij}^*\) and variable means \(\overline{{\mathbf {X}}}_{j}\).

B. Simulation Results for BART and TMLE with Cluster Dummies

In this section, we performed additional simulations with BART and TMLE under Designs 1 and 4. For Design 1, we included cluster dummies (denoted as id) as additional predictors into BART and TMLE. For Design 4, we included two different sets of cluster indicators into BART and TMLE: (i) factor-1-by-factor-2 dummies (denoted as f12id) and (ii) factor-1 dummies and factor-2 dummies (denoted as f1id and f12, respectively).

Figures 6 and 7 summarize the results under Design 1 and Design 4, respectively. From Fig. 6, we observed that including an additional set of cluster dummies did not always help remove bias and RMSE for \(\varvec{\beta }\), and both BART and TMLE with cluster dummies (i.e., BART+id, TMLE+id) still yielded higher bias and RMSE for \(\mathbf {\beta }\) and \(\tau \), compared to our proposed estimators. Similarly, as shown in Fig. 7, including either cluster indicators in BART or TMLE generally did not always remove additional biases from the effect estimates and their biases were still worse than our proposed estimators.

C. Convergence Properties of Proposed Methods Under Design 1

Figure 8 shows the RMSE of our proposed estimators under Design 1 when we increase the cluster size and vary the outcome and the propensity score model specifications. Overall, we see that the RMSE from all three estimators decrease as the cluster size increases so long as the outcome model or the propensity score model is correct, numerically suggesting the double robustness property of our proposed estimators.

D. Simulation Results for Designs 2, 3, and 4

In the following subsections, we provide Tables 6, 7, 8, 9, 10, 11, 12 and 13. Tables 6, 7, 8, 9, 10 and 11 show the performance of our estimators with different estimators for the outcome model and the propensity score.

Tables 7 and 13 replicate Table 2 for Design 2 and Design 4 in order to evaluate the robustness of our methods under different specifications of the propensity score and the outcome model if our estimators for them were unfortunately mis-specified. Overall, as long as we correctly specified either the treatment model or the outcome model, the effect estimates of \(\varvec{\beta }\) and \(\tau \) from our proposed estimators had almost no bias. In contrast, when both the treatment and outcome models were mis-specified, all the estimators were biased. Also, we observed that mis-specifying the outcome model led to estimates with higher variance compared to mis-specifying the treatment model.

1.1 D.1. Design 2: Two-level Data, Cross-level Interaction, and Large Cluster Size

Table 6 Performance of proposed methods under different estimators of the propensity score model and the outcome model in Design 2.

Full size table

Table 7 Performance of estimators under different specifications of the propensity score model and the outcome model in Design 2.

Full size table

Table 8 Performance of proposed methods under different estimators of the propensity score model and the outcome model in Design 3: 100 clusters with cluster sizes of 30.

Full size table

1.2 D.2. Design 3: Two-level Data, No Cross-level Interaction, and Small Cluster Size

Table 9 Performance of proposed methods under different estimators of the propensity score model and the outcome model in Design 3: 100 clusters with cluster sizes of 20.

Full size table

Table 10 Performance of proposed methods under different estimators of the propensity score model and the outcome model in Design 3: 100 clusters with cluster sizes of 10.

Full size table

Table 11 Performance of proposed methods under different estimators of the propensity score model and the outcome model in Design 3: 100 clusters with cluster sizes of 5.

Full size table

Table 12 Performance of proposed methods under different estimators of the propensity score model and the outcome model in Design 4.

Full size table

1.3 D.3. Design 4: Cross-classified data, no cross-level interaction, and large cluster size

Table 13 Performance of estimators under different specifications of the propensity score model and the outcome model in Design 4.

Full size table

E. Covariance Balance

We checked covariate balance with respect to the absolute standardized mean differences and variance ratios between the treatment group and control group. As a rule of thumb, if the mean difference of each covariate is less than 0.1 standard deviation and the variance ratio lies between 4/5 and 5/4, we can achieve good balance of the covariates (Rubin, 2001; Shadish, Clark, & Steiner, 2008; Steiner, Cook, Shadish, & Clark, 2010) . We observed that there was initial imbalance in covariates. After using the marginal mean weighting through stratification (MMW-S) (Hong & Hong, 2009) with the propensity scores obtained from the PR estimator, we achieved acceptable covariate balance between the treated and untreated groups (Fig. 9).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Suk, Y., Kang, H. Robust Machine Learning for Treatment Effects in Multilevel Observational Studies Under Cluster-level Unmeasured Confounding. Psychometrika 87, 310–343 (2022). https://doi.org/10.1007/s11336-021-09805-x

Download citation

Received: 16 September 2020
Revised: 31 July 2021
Published: 15 October 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s11336-021-09805-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust Machine Learning for Treatment Effects in Multilevel Observational Studies Under Cluster-level Unmeasured Confounding

Abstract

Access this article

Similar content being viewed by others

Concepts and Applications of Multivariate Multilevel (MVML) Analysis and Multilevel Structural Equation Modeling (MLSEM)

Analyzing International Large-Scale Assessment Data with a Hierarchical Approach

Analyzing International Large-Scale Assessment Data with a Hierarchical Approach

Notes

References