# A data-adaptive strategy for inverse weighted estimation of causal effects

## Abstract

In most nonrandomized observational studies, differences between treatment groups may arise not only due to the treatment but also because of the effect of confounders. Therefore, causal inference regarding the treatment effect is not as straightforward as in a randomized trial. To adjust for confounding due to measured covariates, the average treatment effect is often estimated by using propensity scores. Typically, propensity scores are estimated by logistic regression. More recent suggestions have been to employ nonparametric classification algorithms from machine learning. In this article, we propose a weighted estimator combining parametric and nonparametric models. Some theoretical results regarding consistency of the procedure are given. Simulation studies are used to assess the performance of the newly proposed methods relative to existing methods, and a data analysis example from the Surveillance, Epidemiology and End Results database is presented.

### Keywords

Boosting algorithms Causal inference Logistic regression Observational data Random forests## 1 Introduction

In the recent clinical literature, a common framework for assessing causal effects of the treatment effect on the response is based on the potential outcomes advocated by Rubin (1974) and Rosenbaum and Rubin (1983). The authors of the latter paper proposed the concept of propensity scores, which is defined as the probability of receiving the treatment given the covariates. They further demonstrated that conditioning on the propensity score, the observed outcomes from an observational study can be viewed as coming from a randomized study.

There are a variety of approaches to adjusting for the propensity score, summarized nicely in an overview by Lunceford and Davidian (2004). Examples of propensity score-based approaches are inverse probability weighting (IPW), matching, subclassification and double-robust estimation. In this article, we focus on the use of IPW to estimate the treatment effect, which we define in Sect. 2. While this has been a popular approach to the estimation of causal effects, Kang and Schafer (2007) argued against the use of such methods due to the fact that causal effects estimated using IPW were sensitive to observations with large weights. However, most of these observations/weights are informative to the analysis, so they cannot be completely discarded. For example, if a treated subject has a low propensity score, the observed outcome of this subject is highly informative about the missing potential outcome for those in the control (untreated) group. An open question is how to deal with extreme weights in IPW estimation procedures.

Another theme addressed in the article is the choice of modeling procedure for propensity scores. It is possible that different methods for estimating the propensity score may lead to different estimates of the treatment effect. In the statistical literature, propensity scores have been typically estimated by logistic regression. Recently, several studies employed machine learning methods as an alternative to logistic regression (LR) for modeling propensity scores (Setoguchi et al. 2008; Lee et al. 2010; McCaffrey et al. 2004).

The layout of the paper is as follows. In Sect. 2, we review the potential outcomes framework. In Sect. 3, we describe two classes of methods for estimating propensity scores, parametric methods and nonparametric ones. We then propose a model-averaging approach that combines parametric and nonparametric estimators in Sect. 4. There, we also present some consistency results. In Sect. 5, we present a simulation study and show that the newly proposed method is superior in terms of reducing bias and variance of the causal effect estimates. Also, we demonstrate how the proposed procedure demonstrates a statistically principled approach to downweighting extreme obserations in IPW estimation procedures. In Sect. 6, we illustrate our method by comparing treatments for cholangiocarcinomas, a cancer of the bile ducts using data collected through the Surveillance, Epidemiology and End Results database.

## 2 A review of the potential outcomes framework

Since the true values of propensity scores are unknown, it is necessary to estimate them in the first stage. The estimation of propensity scores sometimes involves a high-dimensional vector of covariates. Traditionally, this is done by LR. Recently, machine learning methods have been proposed to estimate the propensity scores, such as classification trees or generalized boosted regression. Many simulation studies have shown that different estimation methods employed in the first stage will affect the finite-sample properties of the estimated treatment effect in the second stage. For example, Lee et al. (2010) show that when there is a moderate misspecification in the LR model, ensemble machine learning methods (random forests and generalized boosted regression) yield smaller bias and variance, more consistent 95 % confidence interval coverage.

## 3 Propensity score modeling

In much of the literature, the estimation of the propensity score is usually done by LR. Using LR to estimate propensity score can be achieved by almost any statistical software. However, LR is not without drawbacks. To specify a parametric form of \(e(X)\), only including main effects into the model is usually not adequate, but it is also challenging to determine which interaction and nonlinear terms should be included, especially when the vector of covariates is high-dimensional. In addition, LR is not resistant to outliers (Kang and Schafer 2007; Pregibon 1982). In particular, Kang and Schafer (2007) show when the LR model is misspecified, IPW leads to large bias.

As alternatives to LR, we will consider machine learning procedures such as K-nearest neighbors (KNN), support vector machines (SVM), classification and regression trees (CART) and its various extensions, such as pruned CART, bagged CART, random forests and boosting.

Except for SVM and KNN, all of the above-mentioned algorithms are based on the construction of tree classifiers. In general, a tree classifier works as follows: beginning with a training data set \((\mathbf{X}_i,Z_i), i=1,\ldots ,n\), a tree classifier repeatedly splits nodes based on one of the covariates in \(\mathbf{X}\), until it stops splitting by some stopping criteria (for example, the terminal node only contains training data from one class). Each terminal node is then assigned a class label by the majority of \(Z_i\) that falls in that terminal node. Once a testing data point with a covariate vector \(\mathbf{X}\) is introduced, the data point is run from the top of the tree until it reaches one of the terminal nodes. The prediction is made by the class label of that terminal node. Compared to parametric algorithms, tree-based algorithms have several advantages. First of all, there is no need to assume any parametric model for a tree: in constructing a tree, the algorithms only need to determine the criterion for splitting a node, and when to stop splitting (Breiman et al. 1984). By splitting a tree based on different covariates at different nodes, the algorithm automatically includes interaction terms in the model. Second, because the algorithm is nonparametric, the tree classifier can pick important covariates (in a stepwise manner) even when \(\mathbf{X}\) is high dimensional or most of the covariates are highly correlated (McCaffrey et al. 2004). Moreover, a standard tree classifier is usually very fast to fit and robust to outliers (Breiman et al. 1984).

One of the biggest issues of a standard tree classifier is its tendency to overfit. That is, the constructed tree is usually too adaptive to the training data, and hence yields high prediction errors for testing data. To solve the over-fitting problem, pruned CART was proposed (Breiman et al. 1984). In pruned CART, a tree is fully grown and then pruned back until some stopping criteria are met. For example, the cross-validation error rate of the pruned tree reaches the minimum. Compared to a standard tree classifier, the pruned tree is smaller in size and yields lower prediction errors.

Another class of tree-based algorithms is called random forests, which was first introduced by Breiman (2001). It belongs to the category of so-called ensemble methods: instead of generating one classification tree, it generates many trees. At each node of a tree, a random subset of the covariates are selected and the node is split based on the best split among the selected covariates. For a testing data point with a covariate vector **X**, each tree votes for one of the classes and the prediction can be made by the majority votes among the trees. In the first stage of causal inference, if we apply random forests algorithm, the propensity score could be estimated as the proportion of trees that vote for class 1. Biau et al. (2008) proved the consistency of random forests estimator in terms of predicting the class label. In that paper, they also commented that random forests are among the most accurate general-purpose classifiers available.

Bagged CART (Breiman 1996) also belongs to the category of ensemble tree classifiers. In this algorithm, a bootstrap sample of the original training data is generated with replacement for multiple times and each bootstrap sample produces one classification tree. The bootstrap sample size is usually taken to be the same as the original data set. For a testing data point with a covariate vector \(\mathbf{X}\), the propensity score can be estimated in the same way as in random forests.

One of the simplest nonparametric classification algorithms is called KNN. It works as follows: for a testing data point, finding the \(K\) nearest points in the training set in terms of some distance measure, e.g. Euclidean distance. The class label for the testing data point is assigned to the majority class among the selected \(K\) data points. In the first stage of causal inference, the propensity score for a testing data point could be estimated as the proportion of its \(K\) nearest neighbors that vote for class 1.

## 4 A model-averaging approach

### 4.1 The proposed method: combining logistic regression with nonparametric machine learning methods

As described in the previous section, there are plenty of models to estimate propensity scores from both parametric and nonparametric perspectives. It is understood that there is no uniformly “best” algorithm for all the data sets and we can always employ some model selection criteria to select the best one for a particular data set. However, doing so ignores the randomness and uncertainty in the model selection procedure. In the literature, model-combining/model-averaging techniques are often used to account for the uncertainty in the model selection procedure. Examples are Hoeting et al. (1999), Yang (2001), Yuan and Yang (2005) and Yuan and Ghosh (2008).

After comparing various non-parametric techniques, we recommend \(e_2(\mathbf{X})\) be estimated from the random forests model or GBM. The reason can be seen clearly from our simulation results, which will be presented in Sect. 5.

### 4.2 Consistency of the proposed estimator

In this section, we prove some results regarding consistency of the proposed estimator in (7) when \(e_1(\mathbf{X})\) is estimated by LR and \(e_2(\mathbf{X})\) is estimated by GBM. We first show the consistency of GBM (i.e., \(\hat{e}_2(\mathbf{X})\)). Most of the proof follows Zhang and Yu (2005). Then, we show that the LR estimator (\(\hat{e}_1(\mathbf{X})\)) is still consistent even when there is a misspecification of the parametric model. The assumptions and proof are given by White (1982). In the end, we show the consistency of the proposed estimator which is a weighted average of the two.

**Lemma 4.1**

GBM is one form of greedy boosting defined in Algorithm 2.1. of Zhang and Yu (2005).

*Proof*

**Lemma 4.2**

(a) The function A(f) in Lemma 4.1, (9) is convex and differentiable. (b) A(f) is second-order differentiable and for GBM, the second-order derivative satisfies \(A^{''}_{f,g}(0)\le 1\) where \(A_{f,g}(h)=A(f+hg)\). (c) \(\phi (f, z)\) in Lemma 4.1 satisfies the Lipschitz condition.

*Proof*

**Proposition 4.1**

- (A1)
There exists a unique \(f^{*}\) such that \(A(f^{*})=\text {inf}_{\begin{array}{c} f\in span(S) \end{array}} A(f)\).

- (A2)
For any sequence \(f_m, A(f_m)\overset{p}{\rightarrow } A(f^{*})\) implies \(f_m \overset{p}{\rightarrow } f^{*}\).

- (A3)
Consider two sequences of sample independent numbers \(k_n\) and \(\beta _n\) such that \(\lim _{n\rightarrow \infty } k_n=\infty \) and \(\lim _{n\rightarrow \infty } \beta _n R_n(S)=0\), where \(R_n(S)=E_{D_1^n} R_n(S,D_1^n)\) is the expected Rademacher complexity for GBM. We assume the algorithm (GBM) stops at step \(\hat{k}\) such that \(\hat{k} \le k_n\) and \(||\hat{f}_{\hat{k}}||_1 \le \beta _n\).

*Proof*

**Proposition 4.2**

- (B1)
Conditional on \(\mathbf{X}_i\)\((i=1,\ldots ,n)\), the \(Z_i\) have a joint distribution \(G\) on a parameter space \(\Omega \), with a Radon–Nikodým density \(g = dG/d\nu \) with respect to a dominating measure \(\nu \).

- (B2)The parameter \(\beta \) is in a compact subset \(B \subset R^p\). The logistic likelihoodis measurable in \(z\) for every \(\beta \) in \(B\). Furthermore, it is continuous in \(\beta \) for every \(z\).$$\begin{aligned} f(z,\beta |\mathbf{x}) = e_{\beta } (\mathbf{x})^{z}[1-e_{\beta } (\mathbf{x})]^{(1-z)}, \end{aligned}$$
- (B3)
\(E\{\log g(Z_i)\}\) exists and \( | \log f(z,\beta |\mathbf{x})| \le m(z)\) for all \(\beta \in B\), where \(m\) is integrable with respect to \(G\).

- (B4)
Define \(I(g,f|\beta ) \equiv \int \log g(u|\mathbf{x})dG(u) - \int \log f(u,\beta |\mathbf{x}) dG(u)\) as the Kullback–Leibler distance between \(g\) and \(f\). Assume that \(I(g,f|\beta )\) has a unique minimum at \(\beta _0\).

**Theorem 4.1**

Under the conditions listed in Proposition 4.1 and 4.2, the proposed estimator in (7) is consistent if \(f^{*}(\mathbf{X}),f_{\beta _0}(\mathbf{X}) \in span(S)\cap span(S')\).

*Proof*

Furthermore, if the true log odds of the propensity score can be approximated arbitrarily close by functions lying in the intersection of \(span(S)\) and \(span(S')\), we have that the proposed estimator converge to the true value of propensity scores. Consequently, \(\widehat{ACE}\) in (4) and \(\widehat{ACET}\) in (5) are consistent estimators for ACE and ACET.\(\square \)

### 4.3 The sandwich variance estimator

## 5 Simulation studies

### 5.1 Methodology comparison

To examine the performance of our proposed method, we conducted extensive simulation studies. In the first set of simulations, we compared the proposed combined estimators with different parametric and nonparametric algorithms for estimating the propensity score. We followed a slightly modified simulation structure as in Lee et al. (2010) and Setoguchi et al. (2008). Denote the vector of covariates to be \(\mathbf{X}\). \(\mathbf{X}\) is a 11-dimensional vector with \(X_0\) being the intercept term, \((X_1, X_2, X_3, X_4)\) being confounders, \((X_5, X_6, X_7)\) only related to the treatment assignment, and \((X_8, X_9, X_{10})\) only related to the potential outcomes. In the simulation, \(X_1-X_{10}\) are first generated from \(MVN(0, \varSigma )\) where \(\varSigma \) is a non-identity covariance matrix. Then, \(X_1, X_3, X_6, X_8, X_9\) are dichotomized into 0–1 variables. The treatment indicator \(Z\) is generated from a Bernoulli distribution with \(p\) a function of the covariates. We use the same eight LR models (Scenario (A)–Scenario (G))to generate the treatment indicator as in Lee et al. (2010). While the model in Scenario (A) has main effects only, the model in Scenario (G) has main effects, ten two-way interaction terms and three quadratic terms.

In the first stage, we compared both parametric and nonparametric estimation algorithms. These algorithms include two parametric choices: LR and LDA; six nonparametric choices: CART, pruned CART, bagged CART, RF, GBM, SVM and KNN. We include all ten covariates as predictors and only consider main effects while fitting LR since in practice, researchers do not know the true propensity score model and it is natural for them to assume a linear and additive model. In the second stage, we estimate both ACE and ACET by IPW. The formula are given in (4) and (5).

### 5.2 Results

Simulation results for average causal effect (ACE)

Method | A | B | C | D | E | F | G |
---|---|---|---|---|---|---|---|

ASAM (\(SAM_{max}\)) | |||||||

Logistic | 0.02 (0.05) | 0.02 (0.05) | 0.02 (0.04) | 0.03 (0.07) | 0.03 (0.06) | 0.03 (0.06) | 0.03 (0.07) |

LDA | 0.02 (0.05) | 0.02 (0.05) | 0.02 (0.04) | 0.03 (0.07) | 0.03 (0.06) | 0.03 (0.06) | 0.03 (0.07) |

CART | 0.15 (0.31) | 0.15 (0.31) | 0.13 (0.30) | 0.17 (0.36) | 0.17 (0.35) | 0.15 (0.34) | 0.14 (0.33) |

PRUNE | 0.17 (0.33) | 0.17 (0.33) | 0.14 (0.31) | 0.18 (0.37) | 0.18 (0.37) | 0.16 (0.35) | 0.15 (0.35) |

BAG | 0.14 (0.25) | 0.14 (0.25) | 0.13 (0.25) | 0.17 (0.28) | 0.17 (0.28) | 0.15 (0.26) | 0.14 (0.26) |

RF | 0.06 (0.13) | 0.06 (0.14) | 0.06 (0.14) | 0.07 (0.16) | 0.08 (0.16) | 0.06 (0.14) | 0.06 (0.15) |

GBM | 0.08 (0.15) | 0.09 (0.16) | 0.08 (0.15) | 0.10 (0.17) | 0.10 (0.17) | 0.09 (0.16) | 0.10 (0.17) |

SVM | 0.11 (0.26) | 0.11 (0.25) | 0.07 (0.16) | 0.12 (0.24) | 0.12 (0.23) | 0.12 (0.21) | 0.09 (0.17) |

KNN, k = 3 | 0.25 (0.52) | 0.24 (0.52) | 0.20 (0.35) | 0.27 (0.49) | 0.27 (0.49) | 0.27 (0.44) | 0.22 (0.40) |

KNN, k = 10 | 0.08 (0.19) | 0.08 (0.19) | 0.07 (0.16) | 0.10 (0.22) | 0.10 (0.22) | 0.10 (0.22) | 0.08 (0.20) |

DAMS1 | 0.05 (0.10) | 0.05 (0.11) | 0.06 (0.10) | 0.06 (0.12) | 0.07 (0.12) | 0.06 (0.10) | 0.06 (0.11) |

DAMS2 | 0.06 (0.12) | 0.07 (0.13) | 0.08 (0.14) | 0.08 (0.13) | 0.08 (0.14) | 0.07 (0.12) | 0.09 (0.14) |

Absolute bias (%) | |||||||

Logistic | 6.36 | 6.14 | 5.16 | 7.17 | 6.94 | 7.41 | 8.58 |

LDA | 6.34 | 6.19 | 5.22 | 7.21 | 6.95 | 7.52 | 8.65 |

CART | 18.11 | 15.45 | 13.43 | 16.76 | 15.71 | 19.65 | 16.70 |

PRUNE | 22.58 | 18.85 | 14.07 | 18.92 | 17.69 | 20.32 | 18.29 |

BAG | 14.06 | 12.98 | 11.55 | 14.87 | 14.02 | 15.06 | 12.93 |

RF | 7.00 | 7.39 | 7.38 | 7.40 | 7.60 | 7.58 | 8.18 |

GBM | 9.58 | 9.96 | 10.07 | 10.99 | 11.56 | 10.68 | 12.19 |

SVM | 17.25 | 16.42 | 11.83 | 17.13 | 17.71 | 17.46 | 13.76 |

KNN, k = 3 | 38.00 | 34.62 | 29.78 | 40.65 | 35.70 | 40.31 | 31.13 |

KNN, k = 10 | 9.15 | 8.97 | 8.38 | 9.23 | 9.40 | 9.04 | 9.15 |

DAMS1 | 6.78 | 7.04 | 6.92 | 7.22 | 7.91 | 7.31 | 8.09 |

DAMS2 | 8.34 | 9.12 | 10.55 | 9.56 | 10.70 | 9.09 | 11.08 |

Standard error | |||||||

Logistic | 0.061 | 0.060 | 0.057 | 0.065 | 0.063 | 0.063 | 0.060 |

LDA | 0.061 | 0.060 | 0.057 | 0.065 | 0.063 | 0.063 | 0.060 |

CART | 0.056 | 0.056 | 0.059 | 0.057 | 0.057 | 0.058 | 0.060 |

PRUNE | 0.055 | 0.055 | 0.058 | 0.056 | 0.056 | 0.057 | 0.059 |

BAG | 0.053 | 0.053 | 0.055 | 0.053 | 0.053 | 0.054 | 0.054 |

RF | 0.059 | 0.060 | 0.062 | 0.062 | 0.061 | 0.061 | 0.061 |

GBM | 0.055 | 0.055 | 0.056 | 0.056 | 0.056 | 0.056 | 0.056 |

SVM | 0.053 | 0.053 | 0.054 | 0.055 | 0.054 | 0.055 | 0.055 |

KNN, k = 3 | 0.058 | 0.057 | 0.057 | 0.058 | 0.057 | 0.058 | 0.058 |

KNN, k = 10 | 0.060 | 0.060 | 0.059 | 0.061 | 0.060 | 0.061 | 0.059 |

DAMS1 | 0.057 | 0.056 | 0.054 | 0.058 | 0.057 | 0.058 | 0.055 |

DAMS2 | 0.056 | 0.055 | 0.053 | 0.057 | 0.056 | 0.056 | 0.054 |

95% CI coverage | |||||||

Logistic | 100 | 100 | 100 | 99.9 | 100 | 100 | 100 |

LDA | 100 | 100 | 100 | 99.9 | 100 | 100 | 99.9 |

CART | 75.9 | 83.6 | 90.9 | 81.3 | 84.4 | 73.3 | 83.2 |

PRUNE | 62.9 | 73.2 | 89.3 | 74.7 | 78.6 | 70.4 | 78.4 |

BAG | 90.7 | 93.1 | 95.6 | 90.1 | 90.9 | 89.2 | 91.4 |

RF | 99.8 | 99.9 | 99.9 | 99.9 | 99.9 | 99.7 | 99.7 |

GBM | 98.9 | 98.9 | 98.9 | 98.2 | 97.8 | 98.5 | 97.1 |

SVM | 87.2 | 86.1 | 95.4 | 87.1 | 86 | 86.6 | 92 |

KNN, k = 3 | 22.8 | 34.5 | 44.5 | 19.1 | 27.7 | 19.8 | 44.1 |

KNN, k = 10 | 98.5 | 98.9 | 98.7 | 98.6 | 98.8 | 99.1 | 99.2 |

DAMS1 | 100 | 99.9 | 99.9 | 99.7 | 99.6 | 99.8 | 99.7 |

DAMS2 | 99.7 | 99.4 | 98.8 | 99.4 | 99 | 99.6 | 97.1 |

Simulation results for average causal effect among the treated (ACET)

Method | A | B | C | D | E | F | G |
---|---|---|---|---|---|---|---|

ASAM(\(SAM_{max}\)) | |||||||

Logistic | 0.04 (0.09) | 0.04 (0.10) | 0.05 (0.16) | 0.06 (0.13) | 0.06 (0.15) | 0.06 (0.15) | 0.08 (0.23) |

LDA | 0.04 (0.10) | 0.04 (0.11) | 0.05 (0.16) | 0.06 (0.13) | 0.06 (0.15) | 0.07 (0.15) | 0.08 (0.23) |

CART | 0.16 (0.32) | 0.15 (0.32) | 0.14 (0.32) | 0.17 (0.36) | 0.16 (0.35) | 0.15 (0.35) | 0.14 (0.34) |

PRUNE | 0.17 (0.34) | 0.16 (0.33) | 0.15 (0.32) | 0.18 (0.37) | 0.17 (0.37) | 0.16 (0.36) | 0.15 (0.36) |

BAG | 0.13 (0.25) | 0.13 (0.24) | 0.12 (0.26) | 0.14 (0.28) | 0.14 (0.27) | 0.12 (0.25) | 0.11 (0.27) |

RF | 0.07 (0.16) | 0.07 (0.15) | 0.07 (0.17) | 0.08 (0.18) | 0.08 (0.18) | 0.07 (0.17) | 0.07 (0.17) |

GBM | 0.07 (0.15) | 0.07 (0.14) | 0.07 (0.17) | 0.07 (0.15) | 0.07 (0.15) | 0.07 (0.14) | 0.07 (0.16) |

SVM | 0.09 (0.23) | 0.08 (0.21) | 0.06 (0.19) | 0.09 (0.20) | 0.09 (0.20) | 0.09 (0.19) | 0.07 (0.17) |

KNN, k = 3 | 0.21 (0.44) | 0.19 (0.41) | 0.16 (0.32) | 0.22 (0.42) | 0.20 (0.40) | 0.22 (0.40) | 0.17 (0.36) |

KNN, k = 10 | 0.09 (0.22) | 0.09 (0.21) | 0.08 (0.21) | 0.11 (0.24) | 0.10 (0.23) | 0.10 (0.25) | 0.09 (0.22) |

DAMS1 | 0.05 (0.11) | 0.05 (0.11) | 0.05 (0.16) | 0.05 (0.12) | 0.05 (0.12) | 0.05 (0.12) | 0.05 (0.13) |

DAMS2 | 0.06 (0.12) | 0.05 (0.12) | 0.06 (0.17) | 0.06 (0.13) | 0.06 (0.13) | 0.05 (0.12) | 0.06 (0.14) |

Absolute bias (%) | |||||||

Logistic | 9.35 | 9.67 | 11.23 | 12.58 | 14.89 | 15.25 | 22.27 |

LDA | 9.29 | 9.69 | 11.34 | 12.61 | 14.98 | 15.18 | 22.33 |

CART | 19.08 | 14.70 | 16.29 | 16.94 | 14.94 | 20.85 | 18.59 |

PRUNE | 23.49 | 17.66 | 16.82 | 18.90 | 16.50 | 21.20 | 19.22 |

BAG | 11.77 | 9.38 | 10.69 | 10.20 | 8.94 | 10.22 | 10.40 |

RF | 8.82 | 8.02 | 10.48 | 9.47 | 8.55 | 9.07 | 10.78 |

GBM | 9.98 | 8.61 | 8.53 | 9.37 | 8.11 | 8.50 | 8.60 |

SVM | 14.81 | 12.32 | 9.17 | 12.33 | 11.18 | 11.57 | 10.59 |

KNN,k=3 | 31.24 | 22.71 | 18.26 | 29.52 | 21.16 | 27.94 | 16.91 |

KNN,k=10 | 11.88 | 9.75 | 10.09 | 10.92 | 9.63 | 10.58 | 11.30 |

DAMS1 | 8.12 | 6.93 | 6.85 | 7.92 | 7.53 | 8.14 | 8.44 |

DAMS2 | 9.34 | 7.88 | 7.10 | 8.67 | 7.51 | 7.86 | 7.24 |

Standard error | |||||||

Logistic | 0.068 | 0.068 | 0.061 | 0.076 | 0.075 | 0.074 | 0.069 |

LDA | 0.068 | 0.067 | 0.061 | 0.076 | 0.074 | 0.074 | 0.069 |

CART | 0.060 | 0.060 | 0.068 | 0.062 | 0.062 | 0.062 | 0.067 |

PRUNE | 0.058 | 0.059 | 0.067 | 0.060 | 0.061 | 0.061 | 0.066 |

BAG | 0.057 | 0.056 | 0.062 | 0.058 | 0.057 | 0.058 | 0.060 |

RF | 0.065 | 0.063 | 0.069 | 0.067 | 0.064 | 0.066 | 0.067 |

GBM | 0.061 | 0.060 | 0.062 | 0.064 | 0.062 | 0.063 | 0.063 |

SVM | 0.057 | 0.058 | 0.059 | 0.060 | 0.060 | 0.060 | 0.061 |

KNN,k=3 | 0.063 | 0.062 | 0.063 | 0.064 | 0.063 | 0.064 | 0.065 |

KNN,k=10 | 0.066 | 0.065 | 0.064 | 0.067 | 0.065 | 0.067 | 0.065 |

DAMS1 | 0.062 | 0.061 | 0.058 | 0.065 | 0.063 | 0.065 | 0.060 |

DAMS2 | 0.061 | 0.059 | 0.057 | 0.064 | 0.062 | 0.063 | 0.059 |

95 % CI coverage | |||||||

Logistic | 99.7 | 99.6 | 99.1 | 99.3 | 97.8 | 98.3 | 89.0 |

LDA | 99.6 | 99.6 | 99.2 | 99.2 | 97.9 | 97.8 | 88.9 |

CART | 77.9 | 89.6 | 90 | 84.9 | 89.4 | 75.6 | 84.4 |

PRUNE | 64.4 | 80.3 | 88.6 | 78.6 | 84.2 | 74.4 | 81.6 |

BAG | 95.5 | 98.9 | 97.6 | 96.3 | 98.6 | 97 | 97.9 |

RF | 99.5 | 99.5 | 99.4 | 99.6 | 99.8 | 99.5 | 98.6 |

GBM | 98.1 | 99.3 | 99.7 | 99.2 | 99.8 | 98.9 | 99.2 |

SVM | 90.9 | 96.4 | 98.5 | 96.3 | 96.2 | 95.7 | 97.2 |

KNN, k = 3 | 49.1 | 72.6 | 84.8 | 55 | 77.3 | 61.5 | 87.7 |

KNN, k = 10 | 96.1 | 98.6 | 98.8 | 98.2 | 99.4 | 98.8 | 97.9 |

DAMS1 | 99.8 | 99.9 | 99.9 | 100 | 99.9 | 99.8 | 99.5 |

DAMS2 | 98.7 | 99.5 | 99.9 | 99.5 | 99.8 | 99.2 | 99.9 |

There are several conclusions from the simulation results. First of all, parametric methods tend to yield lower bias but higher variance than nonparametric methods. In general, both parametric methods (LR and LDA) perform reasonably well. Among all nonparametric methods that we tested, random forests and GBM tend to perform the best in terms of bias and variance, and coverage probabilities. These two conclusions validate our proposed method. In other words, due to the trade-off between bias and variance among parametric and nonparametric methods, it is reasonable to combine the parametric method with the nonparametric method. In addition, because of the superior performance of the random forests algorithm and GBM over other nonparametric algorithms, we choose them as the nonparametric component in our newly proposed estimator. In the performance metrics, DAMS1 is the combination of LR with random forests by the proposed method and DAMS2 is the combination of LR with GBM.

In terms of estimating ACE, the newly proposed methods (DAMS1 and DAMS2) have smaller standard errors than LR. When the propensity score model becomes more complicated, DAMS1 tends to give the smallest bias. For estimating ACET, we find that the newly proposed method, DAMS1, yields the smallest bias and variance in any of the eight scenarios compared to LR and random forests; the same conclusion applied to DAMS2 while comparing it to LR and GBM. Meanwhile, DAMS2 outperforms DAMS1 when the true propensity score becomes more complicated.

In observational studies, balance statistics (ASAM and \(SAM_{max}\)) are important indicators of the performance of different propensity score models. The underlying idea is that by achieving balance in the covariates (potential confounders), the bias in the treatment effect estimate due to measured covariates can be reduced (Harder et al. 2010). From the simulation results, We find that in estimating ACE, the largest ASAM value for logisitic regression is only 0.03 and the largest \(SAM_{max}\) is 0.07. Since covariate balance is achieved by LR, there is not much gain by combining parametric with nonparametric models, though the variances of the estimates are reduced. In estimating ACET, For example, in scenario (G), the \(SAM_{max}\) value is 0.22 for logistic regression. In this case, by combing different models, the bias and variance of the causal effect estimates are greatly reduced. In addition, for Scenarios D–G, DAMS1 and DAMS2 achieve better balance than LR.

We further performed a small simulation study to test our proposed method according to the cross-validation criterion. Based on the same data generation procedure as in the previous simulation study, we simulate 100 data sets with sample size \(n=1000\). The Monte-Carlo cross-validation is performed \(V=25\) times with 50 % of the data used in the training set each time. The reference model is set to be the LR model. The cross-validation values are calculated and shown in Table 3.

Monte-Carlo cross validation (\(C_v\)) for average causal effect among the treated (ACET)

Method | A | B | C | D | E | F | G |
---|---|---|---|---|---|---|---|

Logistic | 0.0104 | 0.0085 | 0.0077 | 0.0126 | 0.0117 | 0.0111 | 0.0270 |

LDA | 0.0103 | 0.0085 | 0.0077 | 0.0125 | 0.0114 | 0.0111 | 0.0270 |

CART | 0.0149 | 0.0132 | 0.0200 | 0.0196 | 0.0176 | 0.0210 | 0.0275 |

PRUNE | 0.0187 | 0.0181 | 0.0232 | 0.0254 | 0.0222 | 0.0237 | 0.0276 |

BAG | 0.0095 | 0.0090 | 0.0116 | 0.0130 | 0.0121 | 0.0126 | 0.0183 |

RF | 0.0093 | 0.0083 | 0.0086 | 0.0117 | 0.0110 | 0.0114 | 0.0187 |

GBM | 0.0086 | 0.0084 | 0.0091 | 0.0122 | 0.0113 | 0.0120 | 0.0176 |

SVM | 0.0118 | 0.0133 | 0.0107 | 0.0179 | 0.0184 | 0.0187 | 0.0194 |

KNN,k=3 | 0.0240 | 0.0225 | 0.0213 | 0.0357 | 0.0288 | 0.0389 | 0.0259 |

KNN,k=10 | 0.0107 | 0.0099 | 0.0088 | 0.0140 | 0.0124 | 0.0137 | 0.0192 |

DAMS1 | 0.0084 | 0.0073 | 0.0059 | 0.0108 | 0.0099 | 0.0099 | 0.0163 |

DAMS2 | 0.0083 | 0.0077 | 0.0070 | 0.0113 | 0.0104 | 0.0107 | 0.0168 |

### 5.3 A further simulation study: trimming large weights

Previous literature has shown that treatment effect estimates obtained by IPW are greatly influenced by subjects who receive the treatment but \(\hat{e}(\mathbf{X})\approx 0\) and those who receive control but \(\hat{e}(\mathbf{X})\approx 1\) (in both cases, the weights will be extremely large). Kang and Schafer (2007) argued that even mild lack of fit of the propensity scores in these two regions will lead to large bias. However, it has been shown that LR model often fails to estimate correctly in these regions (Pregibon 1982). Our method works well in the simulation study because it avoids the above two extreme cases and is able to shrink outlying weights to more sensible values. That is, if \(Z=1\) and \(\hat{e}_1(\mathbf{X})\approx 0, \hat{e}(\mathbf{X}, \hat{\lambda })\approx \hat{e}_2(\mathbf{X})\) in the proposed estimator; on the other hand, if \(Z=0\) and \(\hat{e}_1(\mathbf{X})\approx 1, \hat{e}(\mathbf{X}, \hat{\lambda })\approx \hat{e}_2(\mathbf{X})\) in the proposed estimator.

To further illustrate the above viewpoint, we revisited a simulation study by Kang and Schafer (2007). The purpose of the study is to estimate a population mean with the existence of missing data. The authors argue that to estimate the population mean, propensity score based methods perform badly partially because of the extreme weights when there is a misspecification of the propensity score model. We re-run the simulation to check whether our proposed method can improve the performance of propensity score based methods.

Simulation results for double-robust estimators of \(\mu \)

Sample size | PS model | Method | Bias | SD | RMSE | MAE |
---|---|---|---|---|---|---|

\(n=200\) | Fit with \(x_{ij}\) | LR | 0.023 | 2.556 | 2.555 | 1.700 |

DAMS1 | 0.024 | 2.556 | 2.555 | 1.705 | ||

DAMS2 | 0.022 | 2.556 | 2.554 | 1.706 | ||

CBPS | 0.023 | 2.555 | 2.554 | 1.704 | ||

Fit with \(m_{ij}\) | LR | −4.264 | 7.772 | 8.862 | 3.395 | |

DAMS1 | −1.532 | 3.484 | 3.804 | 2.406 | ||

DAMS2 | −1.036 | 3.191 | 3.353 | 2.249 | ||

CBPS | −2.606 | 3.389 | 4.274 | 2.841 | ||

\(n=500\) | Fit with \(x_{ij}\) | LR | −0.038 | 1.588 | 1.588 | 1.136 |

DAMS1 | −0.037 | 1.588 | 1.587 | 1.144 | ||

DAMS2 | −0.038 | 1.588 | 1.588 | 1.145 | ||

CBPS | −0.038 | 1.588 | 1.588 | 1.137 | ||

Fit with \(m_{ij}\) | LR | −6.012 | 8.553 | 10.451 | 4.031 | |

DAMS1 | −1.722 | 2.153 | 2.757 | 1.812 | ||

DAMS2 | −1.388 | 1.985 | 2.421 | 1.654 | ||

CBPS | −3.208 | 2.310 | 3.952 | 3.157 | ||

\(n=1000\) | Fit with \(x_{ij}\) | LR | 0.054 | 1.126 | 1.127 | 0.753 |

DAMS1 | 0.054 | 1.126 | 1.127 | 0.752 | ||

DAMS2 | 0.054 | 1.127 | 1.127 | 0.751 | ||

CBPS | 0.054 | 1.126 | 1.127 | 0.751 | ||

Fit with \(m_{ij}\) | LR | −7.426 | 10.396 | 12.772 | 4.892 | |

DAMS1 | −1.880 | 1.723 | 2.550 | 1.786 | ||

DAMS2 | −1.539 | 1.373 | 2.062 | 1.586 | ||

CBPS | −3.517 | 1.719 | 3.914 | 3.438 |

For double-robust estimators based on true covariates \(x_{ij}\) (see Table 4), the performance of LR and our proposed method are very similar. The LR model we fit here is the true underlying propensity score model. This implies that when the LR model is correct, combing it with another nonparametric model (RF or GBM) does not harm the performance. Based on one simulated dataset, we find that the correlation between the estimated propensity score by logistic regression and GBM is 0.9, and the average weight placed on logistic regression is 0.52 and 0.48 for GBM. When the misspecified covariates \(m_{ij}\) are observed, the estimates by logistic regression are highly biased and lead to large variances. This is partially due to “occasional highly erratic estimates produced by a few enormous weights” (Kang and Schafer 2007). In this case, our proposed method greatly reduces the variance and yields less bias. Compared to LR, they reduce the RMSEs by 57.1 % (\(n=200\), DAMS1), 62.2 % (\(n=200\), DAMS2), 73.6 % (\(n=500\), DAMS1), 76.8 % (\(n=500\), DAMS2), 80.0% (\(n=1000\), DAMS1) and 83.9 % (\(n=1000\), DAMS2). When the covariates are misspecified, we find that CBPS improves the performance of LR but has larger bias than DAMS1 and DAMS2 because it achieves balance in the misspecified covariates.

## 6 Data analysis example

We next fit a sequence of survival analysis models to check whether any treatment, baseline characteristics or tumor characteristics have a statistically significant effect on the survival time of patients from IHC. As described in Online Resource (Table 3), Age, Treatment, Race/ethnicity, Grade, Stage, and Year of diagnosis are all significant predictors of survival from IHC. We further fit a multivariate Cox proportional hazard model with all the significant predictors found in the univariate setting. Since Grade has too many missing values (68.67 %), we excluded this variable in our multivariate model. The variable Stage has 29.56 % missing rate, but we kept it in our model. Based on a multivariate PH analysis from Online Resource (Table 3), we take Age, Race/ethnicity, Stage, and Year of diagnosis to be potential confounders in this study.

*ad hoc*modification can be problematic because these patients carry important information. In Table 5, we denote this method as \(\text {RF}^1\). In addition, random forests model yields some extremely large weights (see Fig. 2). We then employ an alternative approach to deal with this problem: we shrink the top 5 % of the weights (including the infinity weights) to its 95th percentile, denoted as \(\text {RF}^2\) in Table 5. As can be seen from both analysis, \(\text {RF}^1\) has the largest standard errors of \(\hat{\gamma }\) (\(e^{\hat{\gamma }}\) is the estimated hazard ratio of the treatment) and ASAM values among all the approaches. In analysis 1, the balance in the covariates of \(\text {RF}^1\) is even worse than the one without weighting (0.177 vs. 0.123). By shrinking the large weights to the 95th percentile (\(\text {RF}^2\)), the performance is greatly improved but still worse than DAMS1,which is the weighted average of LR and random forests. Compared to the multivariate model and the random forest models, IPW adjusted models by LR and GBM have smaller variances. Consequently, DAMS2 achieves the best performance in terms of variances and balance in the covariates. In conclusion, the use of radiation therapy is found to significantly improve the survival rate by either the multivariate model or the propensity score adjusted model. This finding is consistent with what have been reported in Shinohara et al. (2008).

Survival analysis results

\(\hat{\gamma }\) (SE) | HR (95 % CI) | ASAM | |
---|---|---|---|

Radiation with surgery versus surgery only | |||

Multivariate model | \(-\)0.3236 (0.0865) | 0.72 (0.61–0.86) | 0.123 |

IPW adjusted model by LR | \(-\)0.2165 (0.0853) | 0.81 (0.68–0.95) | 0.015 |

IPW adjusted model by \(\text {RF}^1\) | \(-\)0.1246 (0.1255) | 0.88 (0.69–1.13) | 0.177 |

IPW adjusted model by \(\text {RF}^2\) | \(-\)0.2401 (0.0897) | 0.79 (0.66–0.94) | 0.044 |

IPW adjusted model by GBM | \(-\)0.2045 (0.0818) | 0.82 (0.69–0.96) | 0.021 |

IPW adjusted model by DAMS1 | \(-\)0.2353 (0.0882) | 0.78 (0.66–0.94) | 0.020 |

IPW adjusted model by DAMS2 | \(-\)0.2098 (0.0814) | 0.81 (0.69–0.95) | 0.014 |

Radiation only versus no treatment | |||

Multivariate model | \(-\)0.5189 (0.0747) | 0.60 (0.51–0.69) | 0.152 |

IPW adjusted model by LR | \(-\)0.4646 (0.0666) | 0.63 (0.55–0.72) | 0.035 |

IPW adjusted model by \(\text {RF}^1\) | \(-\)0.5745 (0.1079) | 0.56 (0.46–0.70) | 0.116 |

IPW adjusted model by \(\text {RF}^2\) | \(-\)0.5536 (0.0744) | 0.57 (0.50–0.67) | 0.073 |

IPW adjusted model by GBM | \(-\)0.4620 (0.0655) | 0.63 (0.55–0.72) | 0.027 |

IPW adjusted model by DAMS1 | \(-\)0.4862 (0.0682) | 0.61 (0.54–0.70) | 0.027 |

IPW adjusted model by DAMS2 | \(-\)0.4677 (0.0651) | 0.63 (0.55–0.71) | 0.024 |

To claim the above results are valid, one needs to assume that all confounders in the study are measured. In practice, sensitivity analyses are often conducted to investigate how unmeasured covariates could affect the inferred causal effect. A sensitivity analysis using the approaches of Lin et al. (1998) and Mitra and Heitjan (2007) was performed and the results are given in Online Resource (Table 4); this sensitivity analysis demonstrates that the results obtained from our proposed method are robust against the unknown confounder which we fail to include in the model. In Online Resource, we also show the performance of the proposed method through a simulation study for survival data.

## 7 Discussion

In this article, we have developed a class of weighted estimators that have desirable properties for average causal effect. The approach involves combining the traditional parametric model with more recently developed nonparametric models using machine learning methods in estimating propensity scores. We proposed a weighted average of LR and a machine learning algorithm (random forests or GBM) with weights properly chosen. The proposed methodology is similar in spirit to super learner (van der Laan et al. 2007) and targeted learning (van der Laan and Rose 2011). The first simulation shows that the newly proposed method reduces the variance of the estimates, as well as the bias in most cases. The second simulation study demonstrates when there is a misspecification in the LR model, our proposed method can shrink large weights to produce less biased and variable estimates. When the machine learning algorithm fails to work because of infinite weights, both the second simulation study and the data analysis example show that our proposed method can still work properly.

In causal inference problems, propensity scores are nuisance parameters and what we are really interested is in estimating the causal treatment effect. One way to evaluate the performance of the newly proposed approach for modeling propensity scores is to see how close the estimates are to the true propensity scores using simulations. However, Lunceford and Davidian (2004) showed that conditioning on the estimated propensity score rather than the true propensity score can yield smaller variance of the estimated ACE or ACET. Therefore, we should focus on the quality of the causal estimates. An alternative way to evaluate the propensity score model is to focus on the covariate balance. The underlying idea is that by achieving balance, the bias in the treatment effect estimate due to measured covariates can be reduced (Harder et al. 2010). Recently, some literature has focused on the estimation of propensity scores by achieving balance in the covariates. Examples are McCaffrey et al. (2004), Hainmueller (2012) and Imai and Ratkovic (2014). Following the same idea, some future work can be conducted by developing a formula for \(\lambda \) in (7) that would optimize the balance in the covariates.

It should be noted that the proposed procedure is tailored to estimators of causal effects that are estimated using IPW. In particular, as Kang and Schafer (2007) showed, IPW procedures can suffer from poor performance due to model misspecification. This will manifest itself in terms of observations with extreme weights. In this regard, the proposed methodology can be viewed as developing “robust” weights that incorporate all observations while simultaneously keeping weights from becoming too extreme. While practitioners using causal inference know that observations with extreme weights need to be downweighted in the analysis for estimating causal effects, many of the solutions proposed have been *ad hoc*, while our procedure is more principled.

Although the IHC study in our data analysis example has four treatments, we divided them into two groups and focused on causal inference for binary treatments. It is also desirable to explore how to improve causal inference in the regime of multi-level treatments and extend the work by Imai and Dyk (2004), Tchernis et al. (2005) and McCaffrey et al. (2013).

## Notes

### Acknowledgments

The authors thank Brian Lee for making his code available. The work of Zhu and Ghosh was supported by the National Institute on Drug Abuse Grant P50 DA010075-16 and NCI Grant CA 129102. The work of Mukherjee was supported by NSF Grant DMS-1007494 and NIH/NCI Grant CA156608. The content of this manuscript is solely the responsibility of the author(s) and does not necessarily represent the official views of the National Institute on Drug Abuse or the National Institutes of Health. Mitra would like to acknowledge Eric Shinohara, MD for making the cholangiocarcinoma data available to us.

## Supplementary material

### References

- Biau, G., Devroye, L., Lugosi, G.: Consistency of random forests and other averaging classifiers. J. Mach. Learn. Res.
**9**, 2015–2033 (2008)Google Scholar - Breiman, L.: Bagging predictors. Mach. Learn.
**24**(2), 123–140 (1996)Google Scholar - Breiman, L.: Random forests. Mach. Learn.
**45**(1), 5–32 (2001)CrossRefGoogle Scholar - Breiman, L., Friedman, J., Stone, C., Olshen, R.: Classification and Regression Trees. Chapman & Hall/CRC, Boca Raton (1984)Google Scholar
- Brookhart, M.A., van der Laan, M.J.: A semiparametric model selection criterion with applications to the marginal structural model. Comput. Stat. Data Anal.
**50**(2), 475–498 (2006)CrossRefGoogle Scholar - Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn.
**20**(3), 273–297 (1995)Google Scholar - Freund, Y., Schapire, R.: A desicion-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci.
**55**(1), 119–139 (1997)CrossRefGoogle Scholar - Hainmueller, J.: Entropy balancing for causal effects: a multivariate reweighting method to produce balanced samples in observational studies. Political Anal.
**20**(1), 25–46 (2012)CrossRefGoogle Scholar - Harder, V.S., Stuart, E.A., Anthony, J.C.: Propensity score techniques and the assessment of measured covariate balance to test causal associations in psychological research. Psychol. Methods
**15**(3), 234–249 (2010)PubMedCentralPubMedCrossRefGoogle Scholar - Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2009)CrossRefGoogle Scholar
- Hoeting, J., Madigan, D., Raftery, A., Volinsky, C.: Bayesian model averaging: a tutorial. Stat. Sci.
**14**(4), 382–401 (1999)CrossRefGoogle Scholar - Imai, K., Ratkovic, M.: Covariate balancing propensity score. J. R. Stat. Soc.: Ser. B (Statistical Methodology)
**76**(1), 243–263 (2014)CrossRefGoogle Scholar - Imai, K., Van Dyk, D.: Causal inference with general treatment regimes. J. Am. Stat. Assoc.
**99**(467), 854–866 (2004)CrossRefGoogle Scholar - Kang, J.D.Y., Schafer, J.L.: Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Stat. Sci.
**22**(4), 523–539 (2007)CrossRefGoogle Scholar - Kouassi, D.A., Singh, J.: A semiparametric approach to hazard estimation with randomly censored observations. J. Am. Stat. Assoc.
**92**(440), 1351–1355 (1997)CrossRefGoogle Scholar - Lee, B.K., Lessler, J., Stuart, E.A.: Improving propensity score weighting using machine learning. Stat. Med.
**29**(3), 337–346 (2010)PubMedCentralPubMedGoogle Scholar - Lin, D., Psaty, B., Kronmal, R.: Assessing the sensitivity of regression results to unmeasured confounders in observational studies. Biometrics
**54**(3), 948–963 (1998)PubMedCrossRefGoogle Scholar - Lunceford, J.K., Davidian, M.: Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat. Med.
**23**(19), 2937–2960 (2004)PubMedCrossRefGoogle Scholar - Mays, J.E., Birch, J.B., Starnes, B.A.: Model robust regression: combining parametric, nonparametric, and semiparametric methods. J. Nonparametric Stat.
**13**(2), 245–277 (2001)CrossRefGoogle Scholar - McCaffrey, D.F., Griffin, B.A., Almirall, D., Slaughter, M.E., Ramchand, R., Burgette, L.F.: A tutorial on propensity score estimation for multiple treatments using generalized boosted models. Stat. Med.
**32**(19), 3388–3414 (2013)PubMedCentralPubMedCrossRefGoogle Scholar - McCaffrey, D.F., Ridgeway, G., Morral, A.R.: Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychol. Methods
**9**(4), 403–425 (2004)PubMedCrossRefGoogle Scholar - Mitra, N., Heitjan, D.F.: Sensitivity of the hazard ratio to nonignorable treatment assignment in an observational study. Stat. Med.
**26**(6), 1398–1414 (2007)PubMedCrossRefGoogle Scholar - Nottingham, Q.J., Birch, J.B.: A semiparametric approach to analysing dose-response data. Stat. Med.
**19**(3), 389–404 (2000)PubMedCrossRefGoogle Scholar - Olkin, I., Spiegelman, C.H.: A semiparametric approach to density estimation. J. Am. Stat. Assoc.
**82**(399), 858–865 (1987)CrossRefGoogle Scholar - Pregibon, D.: Resistant fits for some commonly used logistic models with medical applications. Biometrics
**38**, 485–498 (1982)PubMedCrossRefGoogle Scholar - Ridgeway, G.: The state of boosting. Comput. Sci. Stat.
**31**, 172–181 (1999)Google Scholar - Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies for causal effects. Biometrika
**70**(1), 41–55 (1983)CrossRefGoogle Scholar - Rubin, D.B.: Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol.
**66**(5), 688–701 (1974)CrossRefGoogle Scholar - Setoguchi, S., Schneeweiss, S., Brookhart, M.A., Glynn, R.J., Cook, E.F.: Evaluating uses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiol. Drug Saf.
**17**(6), 546–555 (2008)PubMedCentralPubMedCrossRefGoogle Scholar - Shinohara, E.T., Mitra, N., Guo, M., Metz, J.M.: Radiation therapy is associated with improved survival in the adjuvant and definitive treatment of intrahepatic cholangiocarcinoma. Int. J. Radiat. Oncol. Biol. Phys.
**72**(5), 1495–1501 (2008)PubMedCrossRefGoogle Scholar - Stefanski, L.A., Boos, D.D.: The calculus of m-estimation. Am. Stat.
**56**(1), 29–38 (2002)CrossRefGoogle Scholar - Tchernis, R., Horvitz-Lennon, M., Normand, S.L.T.: On the use of discrete choice models for causal inference. Stat. Med.
**24**(14), 2197–2212 (2005)PubMedCrossRefGoogle Scholar - van der Laan, M.J., Polley, E.C., Hubbard, A.E.: Super learner. Stat. Appl. Genet. Mol. Biol.
**6**(1), 1–21 (2007)Google Scholar - van der Laan, M.J., Rose, S.: Targeted Learning: Causal Inference for Observational and Experimental Data. Springer, New York (2011)CrossRefGoogle Scholar
- White, H.: Maximum likelihood estimation of misspecified models. Econometrica
**50**(1), 1–25 (1982)CrossRefGoogle Scholar - Yang, Y.: Adaptive regression by mixing. J. Am. Stat. Assoc.
**96**(454), 574–588 (2001)CrossRefGoogle Scholar - Yuan, Z., Ghosh, D.: Combining multiple biomarker models in logistic regression. Biometrics
**64**(2), 431–439 (2008)PubMedCrossRefGoogle Scholar - Yuan, Z., Yang, Y.: Combining linear regression models. J. Am. Stat. Assoc.
**100**(472), 1202–1214 (2005)CrossRefGoogle Scholar - Zhang, T., Yu, B.: Boosting with early stopping: convergence and consistency. Ann. Stat.
**33**(4), 1538–1579 (2005)CrossRefGoogle Scholar