Skip to main content
Log in

Robust Machine Learning for Treatment Effects in Multilevel Observational Studies Under Cluster-level Unmeasured Confounding

  • Theory and Methods
  • Published:
Psychometrika Aims and scope Submit manuscript

Abstract

Recently, machine learning (ML) methods have been used in causal inference to estimate treatment effects in order to reduce concerns for model mis-specification. However, many ML methods require that all confounders are measured to consistently estimate treatment effects. In this paper, we propose a family of ML methods that estimate treatment effects in the presence of cluster-level unmeasured confounders, a type of unmeasured confounders that are shared within each cluster and are common in multilevel observational studies. We show through simulation studies that our proposed methods are robust from biases from unmeasured cluster-level confounders in a variety of multilevel observational studies. We also examine the effect of taking an algebra course on math achievement scores from the Early Childhood Longitudinal Study, a multilevel observational educational study, using our methods. The proposed methods are available in the CURobustML R package.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. https://github.com/youmisuk/robustML; https://github.com/youmisuk/CURobustML

  2. http://nces.ed.gov/ecls/kindergarten.asp

References

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Youmi Suk.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Appendices

A. Ensemble Learning Algorithms

We provide additional details on the learning algorithms used in our three approaches. The first and third approach, denoted as PR and DDPR in the main paper, use the same learning algorithms. Their learning algorithms, denoted as Algorithms 2 and 3, are detailed below.

figure b
figure c

For the second approach, denoted as DD in the main paper, we estimate \(g_x^*\) and \(h_x^*\) by using Algorithm 4 and Algorithm 5, respectively. Here, we implement a single generalized linear model because we use a heuristic suggested in the main text that includes both the demeaned variables \({\mathbf {X}}_{ij}^*\) and variable means \(\overline{{\mathbf {X}}}_{j}\).

figure d
figure e

B. Simulation Results for BART and TMLE with Cluster Dummies

In this section, we performed additional simulations with BART and TMLE under Designs 1 and 4. For Design 1, we included cluster dummies (denoted as id) as additional predictors into BART and TMLE. For Design 4, we included two different sets of cluster indicators into BART and TMLE: (i) factor-1-by-factor-2 dummies (denoted as f12id) and (ii) factor-1 dummies and factor-2 dummies (denoted as f1id and f12, respectively).

Figures 6 and 7 summarize the results under Design 1 and Design 4, respectively. From Fig. 6, we observed that including an additional set of cluster dummies did not always help remove bias and RMSE for \(\varvec{\beta }\), and both BART and TMLE with cluster dummies (i.e., BART+id, TMLE+id) still yielded higher bias and RMSE for \(\mathbf {\beta }\) and \(\tau \), compared to our proposed estimators. Similarly, as shown in Fig. 7, including either cluster indicators in BART or TMLE generally did not always remove additional biases from the effect estimates and their biases were still worse than our proposed estimators.

Fig. 6
figure 6

Performance of Bayesian additive regression trees (BART) and targeted maximum likelihood estimation (TMLE) with and without cluster dummies under Design 1. id represents cluster dummies. The dashed black line indicates the true effect size of 2 or 3.

Fig. 7
figure 7

Performance of Bayesian additive regression trees (BART) and targeted maximum likelihood estimation (TMLE) with and without cluster dummies under Design 4. f12id represents factor-1-by-factor-2 cluster dummies. f1id and f2id represent factor-1 dummies and factor-2 dummies, respectively. The label BART\(+\)f12id represents BART with cluster dummies f12id included as predictors. The dashed black line indicates the true effect size of 2 or 3.

C. Convergence Properties of Proposed Methods Under Design 1

Figure 8 shows the RMSE of our proposed estimators under Design 1 when we increase the cluster size and vary the outcome and the propensity score model specifications. Overall, we see that the RMSE from all three estimators decrease as the cluster size increases so long as the outcome model or the propensity score model is correct, numerically suggesting the double robustness property of our proposed estimators.

Fig. 8
figure 8

RMSE of proposed methods with increasing cluster sizes under Design 1.

D. Simulation Results for Designs 2, 3, and 4

In the following subsections, we provide Tables 6, 7, 8, 9, 10, 11, 12 and 13. Tables 6, 7, 8, 9, 10 and 11 show the performance of our estimators with different estimators for the outcome model and the propensity score.

Tables 7 and 13 replicate Table 2 for Design 2 and Design 4 in order to evaluate the robustness of our methods under different specifications of the propensity score and the outcome model if our estimators for them were unfortunately mis-specified. Overall, as long as we correctly specified either the treatment model or the outcome model, the effect estimates of \(\varvec{\beta }\) and \(\tau \) from our proposed estimators had almost no bias. In contrast, when both the treatment and outcome models were mis-specified, all the estimators were biased. Also, we observed that mis-specifying the outcome model led to estimates with higher variance compared to mis-specifying the treatment model.

1.1 D.1. Design 2: Two-level Data, Cross-level Interaction, and Large Cluster Size

Table 6 Performance of proposed methods under different estimators of the propensity score model and the outcome model in Design 2.
Table 7 Performance of estimators under different specifications of the propensity score model and the outcome model in Design 2.
Table 8 Performance of proposed methods under different estimators of the propensity score model and the outcome model in Design 3: 100 clusters with cluster sizes of 30.

1.2 D.2. Design 3: Two-level Data, No Cross-level Interaction, and Small Cluster Size

Table 9 Performance of proposed methods under different estimators of the propensity score model and the outcome model in Design 3: 100 clusters with cluster sizes of 20.
Table 10 Performance of proposed methods under different estimators of the propensity score model and the outcome model in Design 3: 100 clusters with cluster sizes of 10.
Table 11 Performance of proposed methods under different estimators of the propensity score model and the outcome model in Design 3: 100 clusters with cluster sizes of 5.
Table 12 Performance of proposed methods under different estimators of the propensity score model and the outcome model in Design 4.

1.3 D.3. Design 4: Cross-classified data, no cross-level interaction, and large cluster size

Fig. 9
figure 9

Covariate balance before and after propensity score adjustment (standardized mean differences on the x-axis and variance ratios on the y-axis).

Table 13 Performance of estimators under different specifications of the propensity score model and the outcome model in Design 4.

E. Covariance Balance

We checked covariate balance with respect to the absolute standardized mean differences and variance ratios between the treatment group and control group. As a rule of thumb, if the mean difference of each covariate is less than 0.1 standard deviation and the variance ratio lies between 4/5 and 5/4, we can achieve good balance of the covariates (Rubin, 2001; Shadish, Clark, & Steiner, 2008; Steiner, Cook, Shadish, & Clark, 2010) . We observed that there was initial imbalance in covariates. After using the marginal mean weighting through stratification (MMW-S) (Hong & Hong, 2009) with the propensity scores obtained from the PR estimator, we achieved acceptable covariate balance between the treated and untreated groups (Fig. 9).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Suk, Y., Kang, H. Robust Machine Learning for Treatment Effects in Multilevel Observational Studies Under Cluster-level Unmeasured Confounding. Psychometrika 87, 310–343 (2022). https://doi.org/10.1007/s11336-021-09805-x

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11336-021-09805-x

Keywords

Navigation