A data-adaptive strategy for inverse weighted estimation of causal effects

  • Yeying Zhu
  • Debashis Ghosh
  • Nandita Mitra
  • Bhramar Mukherjee
Article

Abstract

In most nonrandomized observational studies, differences between treatment groups may arise not only due to the treatment but also because of the effect of confounders. Therefore, causal inference regarding the treatment effect is not as straightforward as in a randomized trial. To adjust for confounding due to measured covariates, the average treatment effect is often estimated by using propensity scores. Typically, propensity scores are estimated by logistic regression. More recent suggestions have been to employ nonparametric classification algorithms from machine learning. In this article, we propose a weighted estimator combining parametric and nonparametric models. Some theoretical results regarding consistency of the procedure are given. Simulation studies are used to assess the performance of the newly proposed methods relative to existing methods, and a data analysis example from the Surveillance, Epidemiology and End Results database is presented.

Keywords

Boosting algorithms Causal inference Logistic regression Observational data Random forests 

1 Introduction

In the recent clinical literature, a common framework for assessing causal effects of the treatment effect on the response is based on the potential outcomes advocated by Rubin (1974) and Rosenbaum and Rubin (1983). The authors of the latter paper proposed the concept of propensity scores, which is defined as the probability of receiving the treatment given the covariates. They further demonstrated that conditioning on the propensity score, the observed outcomes from an observational study can be viewed as coming from a randomized study.

There are a variety of approaches to adjusting for the propensity score, summarized nicely in an overview by Lunceford and Davidian (2004). Examples of propensity score-based approaches are inverse probability weighting (IPW), matching, subclassification and double-robust estimation. In this article, we focus on the use of IPW to estimate the treatment effect, which we define in Sect. 2. While this has been a popular approach to the estimation of causal effects, Kang and Schafer (2007) argued against the use of such methods due to the fact that causal effects estimated using IPW were sensitive to observations with large weights. However, most of these observations/weights are informative to the analysis, so they cannot be completely discarded. For example, if a treated subject has a low propensity score, the observed outcome of this subject is highly informative about the missing potential outcome for those in the control (untreated) group. An open question is how to deal with extreme weights in IPW estimation procedures.

Another theme addressed in the article is the choice of modeling procedure for propensity scores. It is possible that different methods for estimating the propensity score may lead to different estimates of the treatment effect. In the statistical literature, propensity scores have been typically estimated by logistic regression. Recently, several studies employed machine learning methods as an alternative to logistic regression (LR) for modeling propensity scores (Setoguchi et al. 2008; Lee et al. 2010; McCaffrey et al. 2004).

In this work, we propose combining parametric and nonparametric estimators of propensity scores for inverse weighted estimation of causal effects. Unlike previous work in the area, we are able to prove some theoretical properties of these estimators. These estimators have the overall effect of shrinking extreme weights in the IPW estimation procedure, which leads to better finite-sample performance of the average causal effect estimators; this will be seen through a sequence of simulation studies in Sect. 5. Similar ideas for hybrid or weighted estimation have appeared in other research areas. For example, Olkin and Spiegelman (1987) proposed a semiparametric approach to density estimation which combines the parametric maximum likelihood estimator and the nonparametric kernel estimator. Kouassi and Singh (1997) developed a semiparametric estimator for the hazard function with randomly censored survival data. In their example, they combined the estimator of a Weibull parametric model and kernel hazard estimator. Nottingham and Birch (2000) extended the seimparametric approach to quantal dose–response data which combines a quantal parametric model with a local linear regression estimator. Finally, inspired by a regression example where a parametric model is insufficient to fit the entire data set, Mays et al. (2001) introduced several semiparametric approaches to improve the fit, one of which combines the regression fit of ordinary least squares and the fit of local linear regression. In all the above-mentioned approaches, the proposed estimators can be written in a unified manner. Assume the parametric estimator for the targeted estimand is denoted as \(y^{P}\) and the corresponding nonparametric estimator is \(y^{NP}\), the semiparametric estimator can be written as:
$$\begin{aligned} y^{SP}=\lambda y^{P}+(1-\lambda ) y^{NP}, \end{aligned}$$
(1)
where \(\lambda \) is a smoothing parameter estimated from the observed data. While our proposed estimator also combines parametric and nonparameric estimates, our approach is essentially different from the previous literature. First of all, the previous approaches are all one-stage methods, while we are focusing on a two-stage procedure. That is, we apply the combined estimator at stage one and improve IPW to estimate the treatment effect at stage two. We wish to understand the properties of the second-stage estimator. In the previous literature, the smoothing parameter \(\lambda \) in (1) can be regarded as the weight placed on the parametric estimator, which is estimated by minimizing/maximizing a certain objective function, such as MSE or PRESS of \(y^{SP}\). This idea is not directly applicable to the modeling of propensity scores because propensity scores are nuisance parameters and a prediction model with improved MSE for propensity scores does not necessarily lead to better causal inference in the second stage (Lunceford and Davidian 2004; Setoguchi et al. 2008). In our proposed methods, weights are data-adaptive and locally calculated. Essentially, we are able to shrink extreme weights to more reasonable values so the bias and variance of the inverse weighted estimation of causal effects can be reduced. Thirdly, while the nonparametic component in the above-mentioned literature is estimated by kernel estimator or local linear/polynomial estimator, the nonparametric component in our approach is estimated by one of the machine learning algorithms, such as tree-based classifiers.

The layout of the paper is as follows. In Sect. 2, we review the potential outcomes framework. In Sect. 3, we describe two classes of methods for estimating propensity scores, parametric methods and nonparametric ones. We then propose a model-averaging approach that combines parametric and nonparametric estimators in Sect. 4. There, we also present some consistency results. In Sect. 5, we present a simulation study and show that the newly proposed method is superior in terms of reducing bias and variance of the causal effect estimates. Also, we demonstrate how the proposed procedure demonstrates a statistically principled approach to downweighting extreme obserations in IPW estimation procedures. In Sect. 6, we illustrate our method by comparing treatments for cholangiocarcinomas, a cancer of the bile ducts using data collected through the Surveillance, Epidemiology and End Results database.

2 A review of the potential outcomes framework

Let \(Y\) denote the response of interest and \(\mathbf{X}\) be a \(p\)-dimensional vector of covariates. Let \(Z\) be a binary indicator of treatment exposure. We assume that \(Z\) takes the values \(\{0,1\}\): \(Z=1\) if treated, \(Z=0\) if control. Let the observed data be represented as \((Y_i,\mathbf{X}_i,Z_i), i=1,\ldots ,n\), a random sample from \((Y,\mathbf{X},Z)\). We further define \(\{Y(0),Y(1)\}\) to be the potential outcomes for subject \(i\) if control or treated. What we observe is \(Y_i =Y_i(Z_i)\)\((i=1,\ldots ,n)\), which implies that \(Y(0)\) and \(Y(1)\) can not be observed simultaneously, i.e. one of them is missing. Two possible parameters of interest are the average causal effect:
$$\begin{aligned} ACE =E[Y(1) - Y(0)], \end{aligned}$$
(2)
and the average causal effect among the treated:
$$\begin{aligned} ACET = E[Y(1)-Y(0)|Z=1]. \end{aligned}$$
(3)
ACET is of particular interest when the population of the study are those who actually receive the treatment. For example, a researcher from a smoking cessation counseling tries to persuade the smokers to quit smoking and his research question is as follows: for those who actually smoke, what is the difference in the expected life expectancy if they did not smoke? In this example, the researcher is interested in estimating ACET.
Based on the assumption that \(Z \perp \{Y(0),Y(1) \} | {e(\mathbf X})\) where \(e(\mathbf{X})\equiv P(Z=1|\mathbf{X})\) is the propensity score, causal inference is a two-stage modeling process. In the first stage, the propensity score is estimated as a function of covariates. In the second stage, ACE in (2) or ACET in (3) is estimated as a function of the treatment indicator (sometimes with other covariates), adjusted by the propensity score. In this article, we focus on IPW estimation. The estimators for the ACE and ACET are given by
$$\begin{aligned} \widehat{ACE} =\frac{\sum _{i=1}^n Y_i Z_i/\hat{e}(\mathbf{X}_i)}{\sum _{i=1}^n Z_i/\hat{e}(\mathbf{X}_i)}- \frac{\sum _{i=1}^n Y_i(1- Z_i)/(1-\hat{e}(\mathbf{X}_i))}{\sum _{i=1}^n (1-Z_i)/(1-\hat{e}(\mathbf{X}_i))} \end{aligned}$$
(4)
and
$$\begin{aligned} \widehat{ACET} =\frac{\sum _{i=1}^n Y_i Z_i}{\sum _{i=1}^n Z_i} - \frac{\sum _{i=1}^n Y_i(1- Z_i)\hat{e}(\mathbf{X}_i)/(1-\hat{e}(\mathbf{X}_i)}{\sum _{i=1}^n (1-Z_i)\hat{e}(\mathbf{X}_i)/(1- \hat{e}(\mathbf{X}_i))}, \end{aligned}$$
(5)
where \(\hat{e}(\mathbf{X}_i)\) is the estimated propensity score for subject \(i\). We will also refer to the first stage and second stage throughout the paper.

Since the true values of propensity scores are unknown, it is necessary to estimate them in the first stage. The estimation of propensity scores sometimes involves a high-dimensional vector of covariates. Traditionally, this is done by LR. Recently, machine learning methods have been proposed to estimate the propensity scores, such as classification trees or generalized boosted regression. Many simulation studies have shown that different estimation methods employed in the first stage will affect the finite-sample properties of the estimated treatment effect in the second stage. For example, Lee et al. (2010) show that when there is a moderate misspecification in the LR model, ensemble machine learning methods (random forests and generalized boosted regression) yield smaller bias and variance, more consistent 95 % confidence interval coverage.

3 Propensity score modeling

In much of the literature, the estimation of the propensity score is usually done by LR. Using LR to estimate propensity score can be achieved by almost any statistical software. However, LR is not without drawbacks. To specify a parametric form of \(e(X)\), only including main effects into the model is usually not adequate, but it is also challenging to determine which interaction and nonlinear terms should be included, especially when the vector of covariates is high-dimensional. In addition, LR is not resistant to outliers (Kang and Schafer 2007; Pregibon 1982). In particular, Kang and Schafer (2007) show when the LR model is misspecified, IPW leads to large bias.

In the simulation study in Sect. 5, we also show the performance of another parametric approach for estimating propensity scores: linear discriminant analysis (LDA). In LDA, we assume
$$\begin{aligned} e(\mathbf{X})\equiv e(\mathbf{X},\mathbf{\mu _0},\mathbf{\mu _1}, \varSigma )=\frac{f_1(\mathbf{X})\pi _1}{f_0(\mathbf{X})\pi _0+f_1(\mathbf{X})\pi _1}, \end{aligned}$$
where \(f_i(\mathbf{X})\) is the posterior probability density function for class \(i\) and follows a multivariate normal distribution:
$$\begin{aligned} f_i(\mathbf{X})=\frac{1}{(2\pi )^{p/2}|\varSigma |^{1/2}} e^{- (\mathbf{X}-\mu _i)^{T} \varSigma ^{-1}(\mathbf{X}-\mu _i)/2}, \quad i=0,1. \end{aligned}$$
In practice, the mean vectors \(\mu _0, \mu _1\) and the common covariance matrix \(\varSigma \), as well as the prior class probabilities \(\pi _0, \pi _1\) are estimated from training data using sample statistics. Unlike logistic regression, LDA requires assuming that \(\mathbf{X}\) follows a multivariate normal distribution. However, Hastie et al. (2009) claim that in most situations, the two models give very similar results, which is also the case in our simulation study.

As alternatives to LR, we will consider machine learning procedures such as K-nearest neighbors (KNN), support vector machines (SVM), classification and regression trees (CART) and its various extensions, such as pruned CART, bagged CART, random forests and boosting.

Except for SVM and KNN, all of the above-mentioned algorithms are based on the construction of tree classifiers. In general, a tree classifier works as follows: beginning with a training data set \((\mathbf{X}_i,Z_i), i=1,\ldots ,n\), a tree classifier repeatedly splits nodes based on one of the covariates in \(\mathbf{X}\), until it stops splitting by some stopping criteria (for example, the terminal node only contains training data from one class). Each terminal node is then assigned a class label by the majority of \(Z_i\) that falls in that terminal node. Once a testing data point with a covariate vector \(\mathbf{X}\) is introduced, the data point is run from the top of the tree until it reaches one of the terminal nodes. The prediction is made by the class label of that terminal node. Compared to parametric algorithms, tree-based algorithms have several advantages. First of all, there is no need to assume any parametric model for a tree: in constructing a tree, the algorithms only need to determine the criterion for splitting a node, and when to stop splitting (Breiman et al. 1984). By splitting a tree based on different covariates at different nodes, the algorithm automatically includes interaction terms in the model. Second, because the algorithm is nonparametric, the tree classifier can pick important covariates (in a stepwise manner) even when \(\mathbf{X}\) is high dimensional or most of the covariates are highly correlated (McCaffrey et al. 2004). Moreover, a standard tree classifier is usually very fast to fit and robust to outliers (Breiman et al. 1984).

One of the biggest issues of a standard tree classifier is its tendency to overfit. That is, the constructed tree is usually too adaptive to the training data, and hence yields high prediction errors for testing data. To solve the over-fitting problem, pruned CART was proposed (Breiman et al. 1984). In pruned CART, a tree is fully grown and then pruned back until some stopping criteria are met. For example, the cross-validation error rate of the pruned tree reaches the minimum. Compared to a standard tree classifier, the pruned tree is smaller in size and yields lower prediction errors.

Another class of tree-based algorithms is called random forests, which was first introduced by Breiman (2001). It belongs to the category of so-called ensemble methods: instead of generating one classification tree, it generates many trees. At each node of a tree, a random subset of the covariates are selected and the node is split based on the best split among the selected covariates. For a testing data point with a covariate vector X, each tree votes for one of the classes and the prediction can be made by the majority votes among the trees. In the first stage of causal inference, if we apply random forests algorithm, the propensity score could be estimated as the proportion of trees that vote for class 1. Biau et al. (2008) proved the consistency of random forests estimator in terms of predicting the class label. In that paper, they also commented that random forests are among the most accurate general-purpose classifiers available.

Bagged CART (Breiman 1996) also belongs to the category of ensemble tree classifiers. In this algorithm, a bootstrap sample of the original training data is generated with replacement for multiple times and each bootstrap sample produces one classification tree. The bootstrap sample size is usually taken to be the same as the original data set. For a testing data point with a covariate vector \(\mathbf{X}\), the propensity score can be estimated in the same way as in random forests.

Boosting is another class of algorithms that represents ensemble classifiers. Instead of taking a bootstrap sample from the training data with equal probabilities each time, the AdaBoost algorithm (Freund and Schapire 1997) suggests giving more weights to observations that were misclassified more often by the previous trees. Each tree is a weak classifier and the final classifier is a weighted average of all the trees. Generalized boosted model (GBM) (Ridgeway 1999) is an extension of boosting which can directly produce estimates of propensity scores. In GBMs, let \( g(\mathbf{X})=\text {log} [e(\mathbf{X})/(1-e(\mathbf{X}))]\) and the maximum likelihood function is:
$$\begin{aligned} l (g)=\sum _{i=1}^n Z_i g(\mathbf{X}_i)-\text {log}\{1+\text {exp}[g(\mathbf{X}_i)]\}. \end{aligned}$$
(6)
To maximize \(l(g)\) in (6), \(g(\mathbf{X})\) is updated at each iteration with \(g(\mathbf{X})+h(\mathbf{X})\) where \(h(\mathbf{X})\) is the fitted value from a regression tree which models \(\gamma _i=Z_i-1/\{1+\text {exp}[-g(\mathbf{X}_i)]\}\), the largest increase in (6). McCaffrey et al. (2004) provides a detailed algorithm for estimating propensity scores using GBM.
A SVM is also a nonparametric classification method. The objective of SVM is to find the optimal hyperplane that maximizes the margin between two classes (Cortes and Vapnik 1995). If we recode \(Z\) as \(-1\) and \(1\) (\(\tilde{Z} =2Z-1\)), the optimization problem is equivalent to the maximization of the following objective function:
$$\begin{aligned} L_{D}=\sum _{i=1}^n \alpha _i-\frac{1}{2}\sum _{i=1}^{n}\sum _{i'=1}^{n} \alpha _i \alpha _{i'} \tilde{Z}_i \tilde{Z}_{i'} K(\mathbf{X}_{i}, \mathbf{X}_{i'}), \end{aligned}$$
subject to \(\alpha _i\ge 0\) for \(i=1,\dots ,n\) and \(\sum _{i=1}^n \alpha _i \tilde{Z}_i=0\). K is the kernel function that projects original data to higher dimensions and the simplest kernel is \( K(\mathbf{X}_{i}, \mathbf{X}_{i'})=\mathbf{X}_{i}\cdot \mathbf{X}_{i'}\). The above algorithm is used when misclassification in the training data is not allowed. Cortes and Vapnik (1995) extended SVM to allow misclassification in the training data, and the objective function becomes:
$$\begin{aligned} L_{D}=\sum _{i=1}^n \alpha _i-\frac{1}{2}\sum _{i=1}^{n}\sum _{i'=1}^{n} \alpha _i \alpha _{i'} \tilde{Z}_i \tilde{Z}_{i'} \left( K(\mathbf{X}_{i}, \mathbf{X}_{i'})+\frac{1}{C}\delta _{ii'}\right) , \end{aligned}$$
where \(\delta _{ii'}=1\) if \(i=i'\) and 0 otherwise, and \(C > 0\) is a user-selected constant allowing for misclassifications. SVM does not estimate propensity scores directly but by fitting a logistic regression model to \((\mathbf{X}_i,\hat{Z}_i), i=1,\dots ,n\), where \(\hat{Z}_i\) is the estimated class label by SVM.

One of the simplest nonparametric classification algorithms is called KNN. It works as follows: for a testing data point, finding the \(K\) nearest points in the training set in terms of some distance measure, e.g. Euclidean distance. The class label for the testing data point is assigned to the majority class among the selected \(K\) data points. In the first stage of causal inference, the propensity score for a testing data point could be estimated as the proportion of its \(K\) nearest neighbors that vote for class 1.

4 A model-averaging approach

4.1 The proposed method: combining logistic regression with nonparametric machine learning methods

As described in the previous section, there are plenty of models to estimate propensity scores from both parametric and nonparametric perspectives. It is understood that there is no uniformly “best” algorithm for all the data sets and we can always employ some model selection criteria to select the best one for a particular data set. However, doing so ignores the randomness and uncertainty in the model selection procedure. In the literature, model-combining/model-averaging techniques are often used to account for the uncertainty in the model selection procedure. Examples are Hoeting et al. (1999), Yang (2001), Yuan and Yang (2005) and Yuan and Ghosh (2008).

We now develop ways to combine parametric models with nonparametric models. Let \(\hat{e}_1(\mathbf{X})\) be the estimate of the propensity score from a LR model and \(\hat{e}_2(\mathbf{X})\) be the estimate of the propensity score from a nonparametric algorithm, such as a random forests model or generalized boosted model (GBM). We denote the proposed estimator as:
$$\begin{aligned} \hat{e}(\mathbf{X}, \lambda )=\lambda \hat{e}_1(\mathbf{X})+(1-\lambda )\hat{e}_2(\mathbf{X}), \end{aligned}$$
(7)
\(\hat{e}(\mathbf{X}, \lambda )\) is a weighted average of logistic regression and the nonparametric estimator with weights \(\lambda \) and \(1-\lambda \). Intuitively, we would think that if \(\hat{e}_1(\mathbf{X})\) is more accurate, we would give more weight to logistic regression and if \(\hat{e}_2(\mathbf{X})\) is more accurate, we would like to give more weight to the nonparametric model. Therefore, given \((Y_i,\mathbf{X}_i,Z_i), i=1,\ldots ,n\), we propose the following data-based weights:
$$\begin{aligned} \hat{\lambda }_i=\frac{\hat{e}_1 (\mathbf{X}_i)^{Z_i}[1-\hat{e}_1 (\mathbf{X}_i)]^{(1-Z_i)}}{\hat{e}_1 (\mathbf{X}_i)^{Z_i}[1-\hat{e}_1 (\mathbf{X}_i)]^{(1-Z_i)}+\hat{e}_2 (\mathbf{X}_i)^{Z_i}[1-\hat{e}_2 (\mathbf{X}_i)]^{(1-Z_i)}}, \end{aligned}$$
(8)
for \(i=1, 2, \dots , n\). That is, we use the estimated Bernoulli likelihood at each sample point as its weight. Essentially, the choice of weights in (8) reflects the following principle: if \(Z_i=1, \hat{e}(\mathbf{X}_i, \hat{\lambda }_i)\) gives more weight to the higher value of \(\hat{e}_1(\mathbf{X}_i)\) and \(\hat{e}_2(\mathbf{X}_i)\); on the other hand, if \(Z_i=0, \hat{e}(\mathbf{X}_i, \hat{\lambda }_i)\) gives more weight to the lower value of \(\hat{e}_1(\mathbf{X}_i)\) and \(\hat{e}_2(\mathbf{X}_i)\). Since the weights are data-adaptive and subjects with the same \(Z\) and \(\mathbf{X}\) will get the same \(\hat{e}(\mathbf{X}, \hat{\lambda })\), we call the proposed estimator a data-adaptive matching score (DAMS).

After comparing various non-parametric techniques, we recommend \(e_2(\mathbf{X})\) be estimated from the random forests model or GBM. The reason can be seen clearly from our simulation results, which will be presented in Sect. 5.

4.2 Consistency of the proposed estimator

In this section, we prove some results regarding consistency of the proposed estimator in (7) when \(e_1(\mathbf{X})\) is estimated by LR and \(e_2(\mathbf{X})\) is estimated by GBM. We first show the consistency of GBM (i.e., \(\hat{e}_2(\mathbf{X})\)). Most of the proof follows Zhang and Yu (2005). Then, we show that the LR estimator (\(\hat{e}_1(\mathbf{X})\)) is still consistent even when there is a misspecification of the parametric model. The assumptions and proof are given by White (1982). In the end, we show the consistency of the proposed estimator which is a weighted average of the two.

Lemma 4.1

GBM is one form of greedy boosting defined in Algorithm 2.1. of Zhang and Yu (2005).

Proof

Following the previous notations, let \(Z=0\) or 1, \(Z^{*}=2Z-1\), and \(e(\mathbf{X})=P(Z=1|\mathbf{X})=P(Z^{*}=1|\mathbf{X})\). As described by McCaffrey et al. (2004), GBM aims to minimize
$$\begin{aligned} -E l(Z, e(\mathbf{X}))&= -E\{ Z \text {log}\ e(\mathbf{X})+(1-Z) \text {log} (1-e(\mathbf{X}))|\mathbf{X}\}\\&= E\{\text {log}(1+e^{- f(\mathbf{X})Z^{*}})|\mathbf{X}\}, \end{aligned}$$
where \(f(\mathbf{X}) = \log \{ e(\mathbf{X})/{(1-e(\mathbf{X}))} \}\). Define \(\phi (f,z)=\text {log}(1+e^{-fz})\) and
$$\begin{aligned} A(f)=E_{X,Z^{*}}\phi (f(\mathbf{X}),Z^{*}). \end{aligned}$$
(9)
Equivalently, GBM aims to find solution to the following optimization problem
$$\begin{aligned} \text {inf}_{\begin{array}{c} f\in span(S) \end{array}} A(f) \end{aligned}$$
where \(span(S)=\{\sum _{j=1}^{m} w^jf^j: f^j\in S, w^j\in R, m \in Z^{+}\}\) is a linear function space. For GBM, \(f\) is fitted by an additive model with basis functions consisting of regression trees. Therefore, \(S=\{I_{(-\infty ,a_1]\times \dots \times (-\infty ,a_p])}:a_1,\dots ,a_p\in R\}\), which is the indicator of rectangular regions in the feature space. More specifically, given \(f_{k}\) at the \(k\)th step, the algorithm aims to find \(\alpha \in R\) and \(h(\mathbf{X})\in S\) such that \(f_{k+1}=f_k+\alpha h(\mathbf{X})\) approximately minimizes \(A(f)\). Based on the definition of Algorithm 2.1. of Zhang and Yu (2005), GBM is one form of greedy boosting.\(\square \)

Lemma 4.2

(a) The function A(f) in Lemma 4.1, (9) is convex and differentiable. (b) A(f) is second-order differentiable and for GBM, the second-order derivative satisfies \(A^{''}_{f,g}(0)\le 1\) where \(A_{f,g}(h)=A(f+hg)\). (c) \(\phi (f, z)\) in Lemma 4.1 satisfies the Lipschitz condition.

Proof

Using the definition of functional derivative, we have
$$\begin{aligned} A^{'}_{f,g}(0)=\lim _{h\rightarrow 0}\frac{1}{h}[A(f+hg)-A(f)]=-E_{X,Z^{*}}\frac{e^{-f(\mathbf{X})Z^{*}}Z^{*}}{1+e^{-f(\mathbf{X})Z^{*}}}g(\mathbf{X}) \end{aligned}$$
and
$$\begin{aligned} A^{''}_{f,g}(0)&= \lim _{h\rightarrow 0} \frac{1}{h^2} [A(f+hg)-2A(f)+A(f-hg)]\\&= E_{X,Z^{*}} \frac{(Z^{*})^2}{(1+e^{f(\mathbf{X})Z^{*}})(1+e^{-f(\mathbf{X})Z^{*}})}g^2(\mathbf{X}) \end{aligned}$$
For GBM, we have \(|g|=|({Z^{*}+1})/{2}-e(\mathbf{X})|\le 2, (Z^{*})^2=1\), \((1+e^{f(\mathbf{X})Z^{*}})(1+e^{-f(\mathbf{X})Z^{*}})=2+e^{f(\mathbf{X})Z^{*}}+e^{-f(\mathbf{X})Z^{*}}\ge 4\). Therefore, \(0\le A^{''}_{f,g}(0)\le 1\). \(A(f)\) is convex. Let \(f^{'}=fz\). Then through some simple calculation, we can prove that \(\phi (f,z)=\phi (f')\) satisfies the Lipschitz condition.
Given the dataset \(D_1^n = \{(X_1,Z_1^{*}),\dots , (X_n,Z_n^{*})\}\), we follow the definitions in Zhang and Yu (2005): let \(\hat{A}(f)=n^{-1} \sum _{i=1}^n\phi (f(\mathbf{X}_i), Z_i^{*})\) be the empirical risk. Define \(R_n(G,D_1^n)\) as the Rademacher complexity of \(G\):
$$\begin{aligned} R_n(G,D_1^n)=E_{\sigma } \sup _{g\in G} \frac{1}{n} \sum _{i=1}^{n} \sigma _i g(X_i, Z_i^{*}), \end{aligned}$$
where \(\sigma _i=\pm 1\) with probability \(\frac{1}{2}\).

Proposition 4.1

Assume the following:
  1. (A1)

    There exists a unique \(f^{*}\) such that \(A(f^{*})=\text {inf}_{\begin{array}{c} f\in span(S) \end{array}} A(f)\).

     
  2. (A2)

    For any sequence \(f_m, A(f_m)\overset{p}{\rightarrow } A(f^{*})\) implies \(f_m \overset{p}{\rightarrow } f^{*}\).

     
  3. (A3)

    Consider two sequences of sample independent numbers \(k_n\) and \(\beta _n\) such that \(\lim _{n\rightarrow \infty } k_n=\infty \) and \(\lim _{n\rightarrow \infty } \beta _n R_n(S)=0\), where \(R_n(S)=E_{D_1^n} R_n(S,D_1^n)\) is the expected Rademacher complexity for GBM. We assume the algorithm (GBM) stops at step \(\hat{k}\) such that \(\hat{k} \le k_n\) and \(||\hat{f}_{\hat{k}}||_1 \le \beta _n\).

     
Then, we have
$$\begin{aligned} \hat{e}_{\hat{k}}(\mathbf{X}) \overset{p}{\rightarrow } e^{*}(\mathbf{X}), \quad as \ n\rightarrow \infty , \end{aligned}$$
(10)
where \(e^{*}(\mathbf{X})=({1+\text {exp}\{-f^{*}\}})^{-1}\).

Proof

Based on Lemmas 4.1. and 4.2, we find GBM satisfies Assumption 3.1 and 3.2 in Zhang and Yu (2005). In addition, since for tree-based classifiers, we have: \(R_n(S)\le \tilde{C}(d/n)^{1/2} \rightarrow 0\), where \(d\) is the Vapnik–Chervonenkis dimension and \(\tilde{C}\) is a constant, we can always find \(k_n\) and \(\beta _n\), both of \(o(n^{\frac{1}{2}})\) such that \(\lim _{n\rightarrow \infty } k_n=\infty \) and \(\lim _{n\rightarrow \infty } \beta _n R_n(S)=0\). According to Theorem 3.1 in Zhang and Yu (2005), as long as we stop GBM at step \(\hat{k}\) such that \(\hat{k}\le k_n\) and \(||\hat{f}_{\hat{k}}||_1\le \beta _n\), we have
$$\begin{aligned} \lim _{n\rightarrow \infty } E_{D_1^n} A(\hat{f}_{\hat{k}})= A(f^{*}) \end{aligned}$$
(11)
From (11), we have the \(L_1\) convergence:
$$\begin{aligned} \lim _{n\rightarrow \infty } E_{D_1^n}|A(\hat{f}_{\hat{k}})-A(f^{*})|=0. \end{aligned}$$
Consequently, we have that \(A(\hat{f}_{\hat{k}})\overset{p}{\rightarrow } A(f^{*}), \quad as \ n\rightarrow \infty \). This implies that \(\hat{f}_{\hat{k}}(\mathbf{X}) \overset{p}{\rightarrow } f^{*}(\mathbf{X}), \quad as \ n\rightarrow \infty \), which implies \(\hat{e}_{\hat{k}}(\mathbf{X}) \overset{p}{\rightarrow } e^{*}(\mathbf{X}), \quad as \ n\rightarrow \infty \), the desired result.
Next, we show the consistency of the LR estimator of propensity scores. The consistency will hold no matter whether the parametric model is the true underlying distribution or not. Define \(e_\beta (\mathbf{X})=(1+\text {exp}\{-\mathbf{X}^T\beta \})^{-1}, \beta \in R^p, f_\beta (\mathbf{X})=\text {log} \{{e_{\beta }(\mathbf{X})}/{1-e_{\beta }(\mathbf{X})}\}\). The method of maximum likelihood aims to solve
$$\begin{aligned} \text {inf}_{\begin{array}{c} f\in span(S') \end{array}} A(f) \end{aligned}$$
where \(S'=\{X^T \beta , \beta \in R^P\}\).\(\square \)

Proposition 4.2

Assume the following assumptions from White (1982):
  1. (B1)

    Conditional on \(\mathbf{X}_i\)\((i=1,\ldots ,n)\), the \(Z_i\) have a joint distribution \(G\) on a parameter space \(\Omega \), with a Radon–Nikodým density \(g = dG/d\nu \) with respect to a dominating measure \(\nu \).

     
  2. (B2)
    The parameter \(\beta \) is in a compact subset \(B \subset R^p\). The logistic likelihood
    $$\begin{aligned} f(z,\beta |\mathbf{x}) = e_{\beta } (\mathbf{x})^{z}[1-e_{\beta } (\mathbf{x})]^{(1-z)}, \end{aligned}$$
    is measurable in \(z\) for every \(\beta \) in \(B\). Furthermore, it is continuous in \(\beta \) for every \(z\).
     
  3. (B3)

    \(E\{\log g(Z_i)\}\) exists and \( | \log f(z,\beta |\mathbf{x})| \le m(z)\) for all \(\beta \in B\), where \(m\) is integrable with respect to \(G\).

     
  4. (B4)

    Define \(I(g,f|\beta ) \equiv \int \log g(u|\mathbf{x})dG(u) - \int \log f(u,\beta |\mathbf{x}) dG(u)\) as the Kullback–Leibler distance between \(g\) and \(f\). Assume that \(I(g,f|\beta )\) has a unique minimum at \(\beta _0\).

     
Denote \(\hat{\beta }\) as the maximum likelihood estimator of \(\beta \) in LR and \(\hat{e}_\beta (\mathbf{X})=(1+\text {exp}\{-\mathbf{X}^T\hat{\beta }\})^{-1}\). Under the Assumptions (B1)–(B4), White (1982) shows that
$$\begin{aligned} \hat{e}_\beta (\mathbf{X})\overset{p}{\rightarrow } e_{\beta _0}(\mathbf{X}),\quad as \ n\rightarrow \infty \end{aligned}$$
where \(e_{\beta _0}(\mathbf{X})=(1+\text {exp}\{-X^T\beta _0\})^{-1}\). \(\beta _0\) is the so-called least false parameter in the sense that it minimizes the Kullback–Leibler distance between the true model and the parametric model.

Theorem 4.1

Under the conditions listed in Proposition 4.1 and 4.2, the proposed estimator in (7) is consistent if \(f^{*}(\mathbf{X}),f_{\beta _0}(\mathbf{X}) \in span(S)\cap span(S')\).

Proof

If \(f^{*}(\mathbf{X}),f_{\beta _0}(\mathbf{X}) \in span(S)\cap span(S')\), we have both \(\hat{e}_{\hat{k}}(\mathbf{X}) , \hat{e}_{\beta } (\mathbf{X})\) converge to \(e^{*}(\mathbf{X})=e_{\beta _0}(\mathbf{X})\) as \(n\rightarrow \infty \). Following the notations in Proposition 4.1 & 4.2, the proposed estimator in (7) can be rewritten as:
$$\begin{aligned} \hat{e}(\mathbf{X}, \hat{\lambda })=\hat{\lambda }\hat{e}_{\hat{k}}(\mathbf{X})+(1-\hat{\lambda }) \hat{e}_{\beta }(\mathbf{X}), \end{aligned}$$
where \(\hat{\lambda }=\frac{\hat{e}_{\hat{k}}(\mathbf{X}) ^{Z}[1-\hat{e}_{\hat{k}}(\mathbf{X})]^{(1-Z)}}{\hat{e}_{\hat{k}}(\mathbf{X}) ^{Z}[1-\hat{e}_{\hat{k}}(\mathbf{X})]^{(1-Z)}+\hat{e}_{\beta } (\mathbf{X})^{Z}[1-\hat{e}_{\beta }(\mathbf{X})]^{(1-Z)}}\). When \(Z=1\),
$$\begin{aligned} \hat{e}(\mathbf{X}, \hat{\lambda })=\frac{\hat{e}_{\hat{k}}^2(\mathbf{X})+ \hat{e}_{\beta }^2 (\mathbf{X})}{\hat{e}_{\hat{k}}(\mathbf{X})+ \hat{e}_{\beta }(\mathbf{X})}\overset{p}{\rightarrow } e^{*}(\mathbf{X}) \end{aligned}$$
as \(n\rightarrow \infty \). When \(Z=0\),
$$\begin{aligned} \hat{e}(\mathbf{X}, \hat{\lambda })=\frac{(1-\hat{e}_{\hat{k}}(\mathbf{X}))\hat{e}_{\hat{k}}(\mathbf{X})+(1-\hat{e}_{\beta }(\mathbf{X}))\hat{e}_{\beta }(\mathbf{X})}{2-\hat{e}_{\hat{k}}(\mathbf{X})-\hat{e}_{\beta }(\mathbf{X})}\overset{p}{\rightarrow } e^{*}(\mathbf{X}) \end{aligned}$$
as \(n\rightarrow \infty \). Therefore, the proposed estimator \(\hat{e}(\mathbf{X}, \hat{\lambda })\) is consistent if \(e_1(\mathbf{X})\) is estimated by LR and \(e_2(\mathbf{X})\) is estimated by GBM.

Furthermore, if the true log odds of the propensity score can be approximated arbitrarily close by functions lying in the intersection of \(span(S)\) and \(span(S')\), we have that the proposed estimator converge to the true value of propensity scores. Consequently, \(\widehat{ACE}\) in (4) and \(\widehat{ACET}\) in (5) are consistent estimators for ACE and ACET.\(\square \)

4.3 The sandwich variance estimator

Since both \(\widehat{ACE}\) in (4) and \(\widehat{ACET}\) in (5) can be written as solutions to estimating equations, we use the sandwich formula based on the theory of M-estimation (Stefanski and Boos 2002) to get the variance of the estimated treatment effects. Let \(\hat{e}_i\) be the proposed estimator for subject \(i\). Denote \(\hat{\mu }_{1,ACE}={\sum _{i=1}^n Y_i Z_i/\hat{e}_i}({\sum _{i=1}^n Z_i/\hat{e}_i})^{-1}\) as an estimator of E[Y(1)] and \(\hat{\mu }_{0,ACE} \equiv \frac{\sum _{i=1}^n Y_i(1- Z_i)/(1-\hat{e}_i)}{\sum _{i=1}^n (1-Z_i)/(1-\hat{e}_i)}\) as an estimator of E[Y(0)], we have
$$\begin{aligned} \widehat{Var}(\widehat{ACE})=\sum _{i=1}^n \frac{Z_i(Y_i-\hat{\mu }_{1,ACE})^2}{\hat{e}_i^2} /\left( \sum _{i=1}^n \frac{Z_i}{\hat{e}_i}\right) ^2+\sum _{i=1}^n \frac{(1-Z_i)(Y_i-\hat{\mu }_{0,ACE})^2}{(1-\hat{e}_i)^2} /\left( \sum _{i=1}^n \frac{1-Z_i}{1-\hat{e}_i}\right) ^2. \end{aligned}$$
Denote \(\hat{\mu }_{1,ACET}=\frac{\sum _{i=1}^n Y_i Z_i}{\sum _{i=1}^n Z_i}\) as an estimator of \(E[Y(1)|Z=1]\) and \(\hat{\mu }_{0,ACET}=\frac{\sum _{i=1}^n Y_i(1- Z_i)\hat{e}_i/(1-\hat{e}_i)}{\sum _{i=1}^n (1-Z_i)\hat{e}_i/(1-\hat{e}_i)}\) as an estimator of \(E[Y(0)|Z=1]\), we have
$$\begin{aligned} \widehat{Var}(\widehat{ACET})=\sum _{i=1}^n \frac{Z_i(Y_i-\hat{\mu }_{1,ACET})^2}{(\sum _{i=1}^n Z_i)^2}+\sum _{i=1}^n \frac{(1-Z_i)(Y_i-\hat{\mu }_{0,ACET})^2\hat{e}_i^2}{(1-\hat{e}_i)^2} /\left( \sum _{i=1}^n \frac{(1-Z_i)\hat{e}_i}{1-\hat{e}_i}\right) ^2. \end{aligned}$$

5 Simulation studies

5.1 Methodology comparison

To examine the performance of our proposed method, we conducted extensive simulation studies. In the first set of simulations, we compared the proposed combined estimators with different parametric and nonparametric algorithms for estimating the propensity score. We followed a slightly modified simulation structure as in Lee et al. (2010) and Setoguchi et al. (2008). Denote the vector of covariates to be \(\mathbf{X}\). \(\mathbf{X}\) is a 11-dimensional vector with \(X_0\) being the intercept term, \((X_1, X_2, X_3, X_4)\) being confounders, \((X_5, X_6, X_7)\) only related to the treatment assignment, and \((X_8, X_9, X_{10})\) only related to the potential outcomes. In the simulation, \(X_1-X_{10}\) are first generated from \(MVN(0, \varSigma )\) where \(\varSigma \) is a non-identity covariance matrix. Then, \(X_1, X_3, X_6, X_8, X_9\) are dichotomized into 0–1 variables. The treatment indicator \(Z\) is generated from a Bernoulli distribution with \(p\) a function of the covariates. We use the same eight LR models (Scenario (A)–Scenario (G))to generate the treatment indicator as in Lee et al. (2010). While the model in Scenario (A) has main effects only, the model in Scenario (G) has main effects, ten two-way interaction terms and three quadratic terms.

The response Y is then generated by:
$$\begin{aligned} Y=\alpha ^T X+\gamma Z+\epsilon , \epsilon \sim N(0,\sigma ^2), \end{aligned}$$
with \(\gamma =-0.4, \sigma =0.1\) (25 % of the effect of exposure) and
$$\begin{aligned} \alpha =(-3.85, 0.3, -0.36, -0.73, -0.2,0,0,0,0.71, -0.19, 0.26)^T. \end{aligned}$$
For our simulation study, we generated \(n=1000\) sample points each time and repeat the procedure \(N=1000\) times.

In the first stage, we compared both parametric and nonparametric estimation algorithms. These algorithms include two parametric choices: LR and LDA; six nonparametric choices: CART, pruned CART, bagged CART, RF, GBM, SVM and KNN. We include all ten covariates as predictors and only consider main effects while fitting LR since in practice, researchers do not know the true propensity score model and it is natural for them to assume a linear and additive model. In the second stage, we estimate both ACE and ACET by IPW. The formula are given in (4) and (5).

5.2 Results

In our simulation study, we mainly look at the absolute bias (the percentage of the absolute value of the bias compared to \(-\)0.4), standard error and 95 % confidence interval coverage of the estimated causal effect from various methods. In addition, we also report the average standardized absolute mean distance (ASAM), which is calculated in the following ways: for each covariate, calculate the absolute value of the difference (\(d_j\)) in the weighted means between the treatment group and the control group after applying the weights; then, divide \(d_j\) by the standard deviation of the covariate in the treatment group and average \(d_j\) over all the covariates (McCaffrey et al. 2004). Meanwhile, we also report the maximum value of the standardized absolute mean difference (\(SAM_{max}\)), i.e., the maximum value of \(d_j\) over all the covariates. Tables 1 and 2 show the performance metrics for estimating ACE and ACET, respectively.
Table 1

Simulation results for average causal effect (ACE)

Method

A

B

C

D

E

F

G

 

ASAM (\(SAM_{max}\))

Logistic

0.02 (0.05)

0.02 (0.05)

0.02 (0.04)

0.03 (0.07)

0.03 (0.06)

0.03 (0.06)

0.03 (0.07)

LDA

0.02 (0.05)

0.02 (0.05)

0.02 (0.04)

0.03 (0.07)

0.03 (0.06)

0.03 (0.06)

0.03 (0.07)

CART

0.15 (0.31)

0.15 (0.31)

0.13 (0.30)

0.17 (0.36)

0.17 (0.35)

0.15 (0.34)

0.14 (0.33)

PRUNE

0.17 (0.33)

0.17 (0.33)

0.14 (0.31)

0.18 (0.37)

0.18 (0.37)

0.16 (0.35)

0.15 (0.35)

BAG

0.14 (0.25)

0.14 (0.25)

0.13 (0.25)

0.17 (0.28)

0.17 (0.28)

0.15 (0.26)

0.14 (0.26)

RF

0.06 (0.13)

0.06 (0.14)

0.06 (0.14)

0.07 (0.16)

0.08 (0.16)

0.06 (0.14)

0.06 (0.15)

GBM

0.08 (0.15)

0.09 (0.16)

0.08 (0.15)

0.10 (0.17)

0.10 (0.17)

0.09 (0.16)

0.10 (0.17)

SVM

0.11 (0.26)

0.11 (0.25)

0.07 (0.16)

0.12 (0.24)

0.12 (0.23)

0.12 (0.21)

0.09 (0.17)

KNN, k = 3

0.25 (0.52)

0.24 (0.52)

0.20 (0.35)

0.27 (0.49)

0.27 (0.49)

0.27 (0.44)

0.22 (0.40)

KNN, k = 10

0.08 (0.19)

0.08 (0.19)

0.07 (0.16)

0.10 (0.22)

0.10 (0.22)

0.10 (0.22)

0.08 (0.20)

DAMS1

0.05 (0.10)

0.05 (0.11)

0.06 (0.10)

0.06 (0.12)

0.07 (0.12)

0.06 (0.10)

0.06 (0.11)

DAMS2

0.06 (0.12)

0.07 (0.13)

0.08 (0.14)

0.08 (0.13)

0.08 (0.14)

0.07 (0.12)

0.09 (0.14)

 

Absolute bias (%)

Logistic

6.36

6.14

5.16

7.17

6.94

7.41

8.58

LDA

6.34

6.19

5.22

7.21

6.95

7.52

8.65

CART

18.11

15.45

13.43

16.76

15.71

19.65

16.70

PRUNE

22.58

18.85

14.07

18.92

17.69

20.32

18.29

BAG

14.06

12.98

11.55

14.87

14.02

15.06

12.93

RF

7.00

7.39

7.38

7.40

7.60

7.58

8.18

GBM

9.58

9.96

10.07

10.99

11.56

10.68

12.19

SVM

17.25

16.42

11.83

17.13

17.71

17.46

13.76

KNN, k = 3

38.00

34.62

29.78

40.65

35.70

40.31

31.13

KNN, k = 10

9.15

8.97

8.38

9.23

9.40

9.04

9.15

DAMS1

6.78

7.04

6.92

7.22

7.91

7.31

8.09

DAMS2

8.34

9.12

10.55

9.56

10.70

9.09

11.08

 

Standard error

Logistic

0.061

0.060

0.057

0.065

0.063

0.063

0.060

LDA

0.061

0.060

0.057

0.065

0.063

0.063

0.060

CART

0.056

0.056

0.059

0.057

0.057

0.058

0.060

PRUNE

0.055

0.055

0.058

0.056

0.056

0.057

0.059

BAG

0.053

0.053

0.055

0.053

0.053

0.054

0.054

RF

0.059

0.060

0.062

0.062

0.061

0.061

0.061

GBM

0.055

0.055

0.056

0.056

0.056

0.056

0.056

SVM

0.053

0.053

0.054

0.055

0.054

0.055

0.055

KNN, k = 3

0.058

0.057

0.057

0.058

0.057

0.058

0.058

KNN, k = 10

0.060

0.060

0.059

0.061

0.060

0.061

0.059

DAMS1

0.057

0.056

0.054

0.058

0.057

0.058

0.055

DAMS2

0.056

0.055

0.053

0.057

0.056

0.056

0.054

 

95% CI coverage

Logistic

100

100

100

99.9

100

100

100

LDA

100

100

100

99.9

100

100

99.9

CART

75.9

83.6

90.9

81.3

84.4

73.3

83.2

PRUNE

62.9

73.2

89.3

74.7

78.6

70.4

78.4

BAG

90.7

93.1

95.6

90.1

90.9

89.2

91.4

RF

99.8

99.9

99.9

99.9

99.9

99.7

99.7

GBM

98.9

98.9

98.9

98.2

97.8

98.5

97.1

SVM

87.2

86.1

95.4

87.1

86

86.6

92

KNN, k = 3

22.8

34.5

44.5

19.1

27.7

19.8

44.1

KNN, k = 10

98.5

98.9

98.7

98.6

98.8

99.1

99.2

DAMS1

100

99.9

99.9

99.7

99.6

99.8

99.7

DAMS2

99.7

99.4

98.8

99.4

99

99.6

97.1

LDA linear discriminant analysis, CART classification and regression trees, PRUNE pruned CART, BAG bagged CART, RF random forests, GBM generalized boosted model, SVM support vector machines, KNN k-nearest neighbors, DAMS data-adaptive matching scores

Table 2

Simulation results for average causal effect among the treated (ACET)

Method

A

B

C

D

E

F

G

 

ASAM(\(SAM_{max}\))

Logistic

0.04 (0.09)

0.04 (0.10)

0.05 (0.16)

0.06 (0.13)

0.06 (0.15)

0.06 (0.15)

0.08 (0.23)

LDA

0.04 (0.10)

0.04 (0.11)

0.05 (0.16)

0.06 (0.13)

0.06 (0.15)

0.07 (0.15)

0.08 (0.23)

CART

0.16 (0.32)

0.15 (0.32)

0.14 (0.32)

0.17 (0.36)

0.16 (0.35)

0.15 (0.35)

0.14 (0.34)

PRUNE

0.17 (0.34)

0.16 (0.33)

0.15 (0.32)

0.18 (0.37)

0.17 (0.37)

0.16 (0.36)

0.15 (0.36)

BAG

0.13 (0.25)

0.13 (0.24)

0.12 (0.26)

0.14 (0.28)

0.14 (0.27)

0.12 (0.25)

0.11 (0.27)

RF

0.07 (0.16)

0.07 (0.15)

0.07 (0.17)

0.08 (0.18)

0.08 (0.18)

0.07 (0.17)

0.07 (0.17)

GBM

0.07 (0.15)

0.07 (0.14)

0.07 (0.17)

0.07 (0.15)

0.07 (0.15)

0.07 (0.14)

0.07 (0.16)

SVM

0.09 (0.23)

0.08 (0.21)

0.06 (0.19)

0.09 (0.20)

0.09 (0.20)

0.09 (0.19)

0.07 (0.17)

KNN, k = 3

0.21 (0.44)

0.19 (0.41)

0.16 (0.32)

0.22 (0.42)

0.20 (0.40)

0.22 (0.40)

0.17 (0.36)

KNN, k = 10

0.09 (0.22)

0.09 (0.21)

0.08 (0.21)

0.11 (0.24)

0.10 (0.23)

0.10 (0.25)

0.09 (0.22)

DAMS1

0.05 (0.11)

0.05 (0.11)

0.05 (0.16)

0.05 (0.12)

0.05 (0.12)

0.05 (0.12)

0.05 (0.13)

DAMS2

0.06 (0.12)

0.05 (0.12)

0.06 (0.17)

0.06 (0.13)

0.06 (0.13)

0.05 (0.12)

0.06 (0.14)

 

Absolute bias (%)

Logistic

9.35

9.67

11.23

12.58

14.89

15.25

22.27

LDA

9.29

9.69

11.34

12.61

14.98

15.18

22.33

CART

19.08

14.70

16.29

16.94

14.94

20.85

18.59

PRUNE

23.49

17.66

16.82

18.90

16.50

21.20

19.22

BAG

11.77

9.38

10.69

10.20

8.94

10.22

10.40

RF

8.82

8.02

10.48

9.47

8.55

9.07

10.78

GBM

9.98

8.61

8.53

9.37

8.11

8.50

8.60

SVM

14.81

12.32

9.17

12.33

11.18

11.57

10.59

KNN,k=3

31.24

22.71

18.26

29.52

21.16

27.94

16.91

KNN,k=10

11.88

9.75

10.09

10.92

9.63

10.58

11.30

DAMS1

8.12

6.93

6.85

7.92

7.53

8.14

8.44

DAMS2

9.34

7.88

7.10

8.67

7.51

7.86

7.24

 

Standard error

Logistic

0.068

0.068

0.061

0.076

0.075

0.074

0.069

LDA

0.068

0.067

0.061

0.076

0.074

0.074

0.069

CART

0.060

0.060

0.068

0.062

0.062

0.062

0.067

PRUNE

0.058

0.059

0.067

0.060

0.061

0.061

0.066

BAG

0.057

0.056

0.062

0.058

0.057

0.058

0.060

RF

0.065

0.063

0.069

0.067

0.064

0.066

0.067

GBM

0.061

0.060

0.062

0.064

0.062

0.063

0.063

SVM

0.057

0.058

0.059

0.060

0.060

0.060

0.061

KNN,k=3

0.063

0.062

0.063

0.064

0.063

0.064

0.065

KNN,k=10

0.066

0.065

0.064

0.067

0.065

0.067

0.065

DAMS1

0.062

0.061

0.058

0.065

0.063

0.065

0.060

DAMS2

0.061

0.059

0.057

0.064

0.062

0.063

0.059

 

95 % CI coverage

Logistic

99.7

99.6

99.1

99.3

97.8

98.3

89.0

LDA

99.6

99.6

99.2

99.2

97.9

97.8

88.9

CART

77.9

89.6

90

84.9

89.4

75.6

84.4

PRUNE

64.4

80.3

88.6

78.6

84.2

74.4

81.6

BAG

95.5

98.9

97.6

96.3

98.6

97

97.9

RF

99.5

99.5

99.4

99.6

99.8

99.5

98.6

GBM

98.1

99.3

99.7

99.2

99.8

98.9

99.2

SVM

90.9

96.4

98.5

96.3

96.2

95.7

97.2

KNN, k = 3

49.1

72.6

84.8

55

77.3

61.5

87.7

KNN, k = 10

96.1

98.6

98.8

98.2

99.4

98.8

97.9

DAMS1

99.8

99.9

99.9

100

99.9

99.8

99.5

DAMS2

98.7

99.5

99.9

99.5

99.8

99.2

99.9

There are several conclusions from the simulation results. First of all, parametric methods tend to yield lower bias but higher variance than nonparametric methods. In general, both parametric methods (LR and LDA) perform reasonably well. Among all nonparametric methods that we tested, random forests and GBM tend to perform the best in terms of bias and variance, and coverage probabilities. These two conclusions validate our proposed method. In other words, due to the trade-off between bias and variance among parametric and nonparametric methods, it is reasonable to combine the parametric method with the nonparametric method. In addition, because of the superior performance of the random forests algorithm and GBM over other nonparametric algorithms, we choose them as the nonparametric component in our newly proposed estimator. In the performance metrics, DAMS1 is the combination of LR with random forests by the proposed method and DAMS2 is the combination of LR with GBM.

In terms of estimating ACE, the newly proposed methods (DAMS1 and DAMS2) have smaller standard errors than LR. When the propensity score model becomes more complicated, DAMS1 tends to give the smallest bias. For estimating ACET, we find that the newly proposed method, DAMS1, yields the smallest bias and variance in any of the eight scenarios compared to LR and random forests; the same conclusion applied to DAMS2 while comparing it to LR and GBM. Meanwhile, DAMS2 outperforms DAMS1 when the true propensity score becomes more complicated.

In observational studies, balance statistics (ASAM and \(SAM_{max}\)) are important indicators of the performance of different propensity score models. The underlying idea is that by achieving balance in the covariates (potential confounders), the bias in the treatment effect estimate due to measured covariates can be reduced (Harder et al. 2010). From the simulation results, We find that in estimating ACE, the largest ASAM value for logisitic regression is only 0.03 and the largest \(SAM_{max}\) is 0.07. Since covariate balance is achieved by LR, there is not much gain by combining parametric with nonparametric models, though the variances of the estimates are reduced. In estimating ACET, For example, in scenario (G), the \(SAM_{max}\) value is 0.22 for logistic regression. In this case, by combing different models, the bias and variance of the causal effect estimates are greatly reduced. In addition, for Scenarios D–G, DAMS1 and DAMS2 achieve better balance than LR.

In our simulation study, the two-stage causal inference also fits the model structure discussed by Brookhart and Laan (2006). Using their notation, if we denote ACE or ACET as \(\psi \), and the propensity score as \(\eta , \psi \) is hence the parameter of interest and \(\eta \) is the nuisance parameter. The issue of choosing an optimal model to estimate propensity scores can be restated as follows: assuming we have \(K\) different candidate models for estimating \(\eta \), which model is optimal? Denote the resulting estimates of \(\psi \) (ACE or ACET) from the \(K\) candidate models as \(\hat{\psi }_1(\mathbf{X})\),...,\(\hat{\psi }_K(\mathbf{X})\), and let \(\hat{\psi }_0(\mathbf{X})\) be an approximately unbiased but highly variable estimate for \(\psi \). The model used to estimate \(\eta \) in \(\hat{\psi }_0(\mathbf{X})\) is regarded as the reference model. To account for the fact that there is a trade-off between bias and variance while estimating \(\psi \), the authors proposed a cross-validation criterion for selecting the optimal estimator of the nuisance parameter among the \(K\) candidate models. Let \(X_v^0\) be the training sample and \(X_v^1\) be the testing sample in the \(v\)-th iteration of the Monte-Carlo cross-validation, the criterion function is defined as follows:
$$\begin{aligned} C_v(k)=\frac{1}{V}\sum _{v=1}^V (\hat{\psi }_k (X_v^0)-\hat{\psi }_0(X_v^1))^2. \nonumber \end{aligned}$$
The optimal model for estimating propensity scores is then chosen to be the one which leads to the smallest \(C_v\) among the \(K\) models. Brookhart and Laan (2006) proved that the optimal model selected by the Monte Carlo cross-validation criteria leads to the smallest mean square error of the parameter of interest.

We further performed a small simulation study to test our proposed method according to the cross-validation criterion. Based on the same data generation procedure as in the previous simulation study, we simulate 100 data sets with sample size \(n=1000\). The Monte-Carlo cross-validation is performed \(V=25\) times with 50 % of the data used in the training set each time. The reference model is set to be the LR model. The cross-validation values are calculated and shown in Table 3.

As can be seen from the table, in all scenarios, the best models are the two proposed models which yield the smallest \(C_v\) values. We also tried dividing 66.7 % of the data into the training set and the results are very similar.
Table 3

Monte-Carlo cross validation (\(C_v\)) for average causal effect among the treated (ACET)

Method

A

B

C

D

E

F

G

Logistic

0.0104

0.0085

0.0077

0.0126

0.0117

0.0111

0.0270

LDA

0.0103

0.0085

0.0077

0.0125

0.0114

0.0111

0.0270

CART

0.0149

0.0132

0.0200

0.0196

0.0176

0.0210

0.0275

PRUNE

0.0187

0.0181

0.0232

0.0254

0.0222

0.0237

0.0276

BAG

0.0095

0.0090

0.0116

0.0130

0.0121

0.0126

0.0183

RF

0.0093

0.0083

0.0086

0.0117

0.0110

0.0114

0.0187

GBM

0.0086

0.0084

0.0091

0.0122

0.0113

0.0120

0.0176

SVM

0.0118

0.0133

0.0107

0.0179

0.0184

0.0187

0.0194

KNN,k=3

0.0240

0.0225

0.0213

0.0357

0.0288

0.0389

0.0259

KNN,k=10

0.0107

0.0099

0.0088

0.0140

0.0124

0.0137

0.0192

DAMS1

0.0084

0.0073

0.0059

0.0108

0.0099

0.0099

0.0163

DAMS2

0.0083

0.0077

0.0070

0.0113

0.0104

0.0107

0.0168

5.3 A further simulation study: trimming large weights

Previous literature has shown that treatment effect estimates obtained by IPW are greatly influenced by subjects who receive the treatment but \(\hat{e}(\mathbf{X})\approx 0\) and those who receive control but \(\hat{e}(\mathbf{X})\approx 1\) (in both cases, the weights will be extremely large). Kang and Schafer (2007) argued that even mild lack of fit of the propensity scores in these two regions will lead to large bias. However, it has been shown that LR model often fails to estimate correctly in these regions (Pregibon 1982). Our method works well in the simulation study because it avoids the above two extreme cases and is able to shrink outlying weights to more sensible values. That is, if \(Z=1\) and \(\hat{e}_1(\mathbf{X})\approx 0, \hat{e}(\mathbf{X}, \hat{\lambda })\approx \hat{e}_2(\mathbf{X})\) in the proposed estimator; on the other hand, if \(Z=0\) and \(\hat{e}_1(\mathbf{X})\approx 1, \hat{e}(\mathbf{X}, \hat{\lambda })\approx \hat{e}_2(\mathbf{X})\) in the proposed estimator.

To further illustrate the above viewpoint, we revisited a simulation study by Kang and Schafer (2007). The purpose of the study is to estimate a population mean with the existence of missing data. The authors argue that to estimate the population mean, propensity score based methods perform badly partially because of the extreme weights when there is a misspecification of the propensity score model. We re-run the simulation to check whether our proposed method can improve the performance of propensity score based methods.

The simulation setup in Kang and Schafer (2007) is as follows: the covariates \(\mathbf{X}=(X_1,X_2,X_3,X_4)\) are generated from \(MVN(0, I_{4\times 4})\) and the response Y is generated from
$$\begin{aligned} Y=210+27.4X_1+13.7X_2+13.7X_3+13.7X_4+\epsilon , \epsilon \sim N(0,1) \end{aligned}$$
and the true propensity scores are:
$$\begin{aligned} e(\mathbf{X})=P(Z=1|\mathbf{X})=\frac{1}{\text {exp}\{-(-X_1+0.5X_2-0.25X_3-0.1X_4)\}}, \end{aligned}$$
where \(z_i=1\) if \(y_i\) is observed and \(z_i=0\) if \(y_i\) is missing. In addition, the authors assume that, instead of observing \(x_{ij}, i=1,\dots ,n, j=1,\dots ,4\), the following covariates are observed: \(m_{i1}=\text {exp}(x_{i1}/2), m_{i2}=x_{i2}/(1+\text {exp}(x_{i2}))+10, m_{i3}=(x_{i1}x_{i3}/25+0.6)^3\), \(m_{i4}=(x_{i2}+x_{i4}+20)^2\). The objective is to estimate \(\mu =E(Y)\) based on those respondents whose \(Z=1\) using IPW, stratification and bias-corrected (double-robust) estimators. In this study, we focus on the double-robust estimation. Details of the calculation for this method can be found in Kang and Schafer (2007). This approach relies on the estimation of propensity scores, as well as the outcome model. In the simulation, we fit the propensity score model by LR, random forests, GBM and the proposed methods, based on \(x_{ij}\) and \(m_{ij}\), separately. To be noticed, the true propensity score model is the LR model of \(z_i\) on \(x_{ij}\).
We run the simulation over 1000 samples with different sample sizes (\(n=200, 500, 1000\)) and report the bias, standard deviation (SD), square root of the mean squared error (RMSE) and median of the absolute error (MAE) for \(\hat{\mu }\). The first thing we notice is that, there are quite a few respondents whose propensity scores are estimated as 0 by random forests when \(m_{ij}\)’s are used. As a result, their inverse probability weights are infinite, which makes it impossible to produce an estimate of \(\mu \) if we only employ random forests. However, our proposed method can still work because in equation (8) of our article, if \(\hat{e}_2(\mathbf{X})=0\) and \(Z=1\), the weight placed on \(\hat{e}_2(\mathbf{X})\) is zero and \(\hat{e}(\mathbf{X}, \hat{\lambda })=\hat{e}_1(\mathbf{X})\). Therefore, in the following analysis, we compare the results of LR, DAMS1 and DAMS2. For comparison, we also show the performance of a recently proposed approach: covariate balancing propensity score (CBPS) by Imai and Ratkovic (2014). CBPS estimates propensity scores by achieving balance in the covariates. The results are shown in Table 4.
Table 4

Simulation results for double-robust estimators of \(\mu \)

Sample size

PS model

Method

Bias

SD

RMSE

MAE

\(n=200\)

Fit with \(x_{ij}\)

LR

0.023

2.556

2.555

1.700

 

DAMS1

0.024

2.556

2.555

1.705

 

DAMS2

0.022

2.556

2.554

1.706

 

CBPS

0.023

2.555

2.554

1.704

Fit with \(m_{ij}\)

LR

−4.264

7.772

8.862

3.395

 

DAMS1

−1.532

3.484

3.804

2.406

 

DAMS2

−1.036

3.191

3.353

2.249

 

CBPS

−2.606

3.389

4.274

2.841

\(n=500\)

Fit with \(x_{ij}\)

LR

−0.038

1.588

1.588

1.136

 

DAMS1

−0.037

1.588

1.587

1.144

 

DAMS2

−0.038

1.588

1.588

1.145

 

CBPS

−0.038

1.588

1.588

1.137

Fit with \(m_{ij}\)

LR

−6.012

8.553

10.451

4.031

 

DAMS1

−1.722

2.153

2.757

1.812

 

DAMS2

−1.388

1.985

2.421

1.654

 

CBPS

−3.208

2.310

3.952

3.157

\(n=1000\)

Fit with \(x_{ij}\)

LR

0.054

1.126

1.127

0.753

 

DAMS1

0.054

1.126

1.127

0.752

 

DAMS2

0.054

1.127

1.127

0.751

 

CBPS

0.054

1.126

1.127

0.751

Fit with \(m_{ij}\)

LR

−7.426

10.396

12.772

4.892

 

DAMS1

−1.880

1.723

2.550

1.786

 

DAMS2

−1.539

1.373

2.062

1.586

 

CBPS

−3.517

1.719

3.914

3.438

For double-robust estimators based on true covariates \(x_{ij}\) (see Table 4), the performance of LR and our proposed method are very similar. The LR model we fit here is the true underlying propensity score model. This implies that when the LR model is correct, combing it with another nonparametric model (RF or GBM) does not harm the performance. Based on one simulated dataset, we find that the correlation between the estimated propensity score by logistic regression and GBM is 0.9, and the average weight placed on logistic regression is 0.52 and 0.48 for GBM. When the misspecified covariates \(m_{ij}\) are observed, the estimates by logistic regression are highly biased and lead to large variances. This is partially due to “occasional highly erratic estimates produced by a few enormous weights” (Kang and Schafer 2007). In this case, our proposed method greatly reduces the variance and yields less bias. Compared to LR, they reduce the RMSEs by 57.1 % (\(n=200\), DAMS1), 62.2 % (\(n=200\), DAMS2), 73.6 % (\(n=500\), DAMS1), 76.8 % (\(n=500\), DAMS2), 80.0% (\(n=1000\), DAMS1) and 83.9 % (\(n=1000\), DAMS2). When the covariates are misspecified, we find that CBPS improves the performance of LR but has larger bias than DAMS1 and DAMS2 because it achieves balance in the misspecified covariates.

6 Data analysis example

In this section, we apply the proposed method to a case study. The data set was obtained from a study of 3894 patients with intrahepatic cholangiocarcinomas (IHC, Shinohara et al. (2008)). In this medical study, the response variable is the survival time after diagnosis with IHC and the treatments are: surgery and radiation, surgery only, radiation only and no treatment. Among the patients, more than 50 % of them receive no treatment because those patients have advanced disease so it is difficult to remove the tumor (have the surgery) and it is unclear whether or not radiation really helps. The Kaplan–Meier estimators of survival functions separated by treatment groups are displayed in Fig. 1. Further descriptive statistics can be found in Online Resource (Tables 1 and 2); there it was seen that the distributions of age, race, tumor stage, tumor grade and year of diagnosis are significantly different among different treatment groups.
Fig. 1

Kaplan–Meier estimators of survival functions by different treatment groups

We next fit a sequence of survival analysis models to check whether any treatment, baseline characteristics or tumor characteristics have a statistically significant effect on the survival time of patients from IHC. As described in Online Resource (Table 3), Age, Treatment, Race/ethnicity, Grade, Stage, and Year of diagnosis are all significant predictors of survival from IHC. We further fit a multivariate Cox proportional hazard model with all the significant predictors found in the univariate setting. Since Grade has too many missing values (68.67 %), we excluded this variable in our multivariate model. The variable Stage has 29.56 % missing rate, but we kept it in our model. Based on a multivariate PH analysis from Online Resource (Table 3), we take Age, Race/ethnicity, Stage, and Year of diagnosis to be potential confounders in this study.

Since the purpose of this study is to examine whether the additional use of radiation therapy would improve the mortality from IHC, we therefore conduct two separate analyses: in the first analysis, we compare surgery and radiation with surgery only and in the second analysis, we compare radiation only with no treatment. We focus on the estimation of ACE. To be noticed, in the first analysis, surgery with radiation will be the treatment and surgery only will be the control. For each analysis, we wish to determine whether the use of radiation therapy is associated with longer survival. Researchers usually analyse such data using the Cox proportional hazards (PH) model, which models the hazard function conditional on treatment and baseline covariates:
$$\begin{aligned} \lambda (t)= \lambda _0 (t) \text {exp} \{Z \gamma + \mathbf{X}^T\beta \}, \end{aligned}$$
(12)
where \(\lambda (t)\) is the hazard rate at time \(t, \lambda _0 (t)\) is the baseline hazard rate at time \(t, Z\) is the treatment indicator and \(\mathbf{X}\) is the baseline covariates . Due to the existence of confounders in the study, IPW method can be used to adjust for confounding. Therefore, we assign each patient a weight by the IPW scheme. That is, for the treated, the weight is \(1/\hat{e}(\mathbf{X}, \hat{\lambda })\) and for the control, the weight is \(1/(1-\hat{e}(\mathbf{X}, \hat{\lambda }))\), where \(\hat{e}(\mathbf{X}, \hat{\lambda })\) is the proposed data-adaptive matching score. For comparison, we also fit the propensity score model using logistic regression, random forests and GBM, separately.
We notice that, for eight patients in the treatment groups, their propensity scores are estimated to be 0 by random forests model, which makes their inverse probability weights infinity. As a result, when we employ IPW by random forests to estimate the treatment effect, we have to exclude these patients in the analysis. This ad hoc modification can be problematic because these patients carry important information. In Table 5, we denote this method as \(\text {RF}^1\). In addition, random forests model yields some extremely large weights (see Fig. 2). We then employ an alternative approach to deal with this problem: we shrink the top 5 % of the weights (including the infinity weights) to its 95th percentile, denoted as \(\text {RF}^2\) in Table 5. As can be seen from both analysis, \(\text {RF}^1\) has the largest standard errors of \(\hat{\gamma }\) (\(e^{\hat{\gamma }}\) is the estimated hazard ratio of the treatment) and ASAM values among all the approaches. In analysis 1, the balance in the covariates of \(\text {RF}^1\) is even worse than the one without weighting (0.177 vs. 0.123). By shrinking the large weights to the 95th percentile (\(\text {RF}^2\)), the performance is greatly improved but still worse than DAMS1,which is the weighted average of LR and random forests. Compared to the multivariate model and the random forest models, IPW adjusted models by LR and GBM have smaller variances. Consequently, DAMS2 achieves the best performance in terms of variances and balance in the covariates. In conclusion, the use of radiation therapy is found to significantly improve the survival rate by either the multivariate model or the propensity score adjusted model. This finding is consistent with what have been reported in Shinohara et al. (2008).
Fig. 2

The boxplots of inverse probability weights calculated from each model

Table 5

Survival analysis results

 

\(\hat{\gamma }\) (SE)

HR (95 % CI)

ASAM

Radiation with surgery versus surgery only

 Multivariate model

\(-\)0.3236 (0.0865)

0.72 (0.61–0.86)

0.123

 IPW adjusted model by LR

\(-\)0.2165 (0.0853)

0.81 (0.68–0.95)

0.015

 IPW adjusted model by \(\text {RF}^1\)

\(-\)0.1246 (0.1255)

0.88 (0.69–1.13)

0.177

 IPW adjusted model by \(\text {RF}^2\)

\(-\)0.2401 (0.0897)

0.79 (0.66–0.94)

0.044

 IPW adjusted model by GBM

\(-\)0.2045 (0.0818)

0.82 (0.69–0.96)

0.021

 IPW adjusted model by DAMS1

\(-\)0.2353 (0.0882)

0.78 (0.66–0.94)

0.020

 IPW adjusted model by DAMS2

\(-\)0.2098 (0.0814)

0.81 (0.69–0.95)

0.014

Radiation only versus no treatment

 Multivariate model

\(-\)0.5189 (0.0747)

0.60 (0.51–0.69)

0.152

 IPW adjusted model by LR

\(-\)0.4646 (0.0666)

0.63 (0.55–0.72)

0.035

 IPW adjusted model by \(\text {RF}^1\)

\(-\)0.5745 (0.1079)

0.56 (0.46–0.70)

0.116

 IPW adjusted model by \(\text {RF}^2\)

\(-\)0.5536 (0.0744)

0.57 (0.50–0.67)

0.073

 IPW adjusted model by GBM

\(-\)0.4620 (0.0655)

0.63 (0.55–0.72)

0.027

 IPW adjusted model by DAMS1

\(-\)0.4862 (0.0682)

0.61 (0.54–0.70)

0.027

 IPW adjusted model by DAMS2

\(-\)0.4677 (0.0651)

0.63 (0.55–0.71)

0.024

To claim the above results are valid, one needs to assume that all confounders in the study are measured. In practice, sensitivity analyses are often conducted to investigate how unmeasured covariates could affect the inferred causal effect. A sensitivity analysis using the approaches of Lin et al. (1998) and Mitra and Heitjan (2007) was performed and the results are given in Online Resource (Table 4); this sensitivity analysis demonstrates that the results obtained from our proposed method are robust against the unknown confounder which we fail to include in the model. In Online Resource, we also show the performance of the proposed method through a simulation study for survival data.

7 Discussion

In this article, we have developed a class of weighted estimators that have desirable properties for average causal effect. The approach involves combining the traditional parametric model with more recently developed nonparametric models using machine learning methods in estimating propensity scores. We proposed a weighted average of LR and a machine learning algorithm (random forests or GBM) with weights properly chosen. The proposed methodology is similar in spirit to super learner (van der Laan et al. 2007) and targeted learning (van der Laan and Rose 2011). The first simulation shows that the newly proposed method reduces the variance of the estimates, as well as the bias in most cases. The second simulation study demonstrates when there is a misspecification in the LR model, our proposed method can shrink large weights to produce less biased and variable estimates. When the machine learning algorithm fails to work because of infinite weights, both the second simulation study and the data analysis example show that our proposed method can still work properly.

In causal inference problems, propensity scores are nuisance parameters and what we are really interested is in estimating the causal treatment effect. One way to evaluate the performance of the newly proposed approach for modeling propensity scores is to see how close the estimates are to the true propensity scores using simulations. However, Lunceford and Davidian (2004) showed that conditioning on the estimated propensity score rather than the true propensity score can yield smaller variance of the estimated ACE or ACET. Therefore, we should focus on the quality of the causal estimates. An alternative way to evaluate the propensity score model is to focus on the covariate balance. The underlying idea is that by achieving balance, the bias in the treatment effect estimate due to measured covariates can be reduced (Harder et al. 2010). Recently, some literature has focused on the estimation of propensity scores by achieving balance in the covariates. Examples are McCaffrey et al. (2004), Hainmueller (2012) and Imai and Ratkovic (2014). Following the same idea, some future work can be conducted by developing a formula for \(\lambda \) in (7) that would optimize the balance in the covariates.

It should be noted that the proposed procedure is tailored to estimators of causal effects that are estimated using IPW. In particular, as Kang and Schafer (2007) showed, IPW procedures can suffer from poor performance due to model misspecification. This will manifest itself in terms of observations with extreme weights. In this regard, the proposed methodology can be viewed as developing “robust” weights that incorporate all observations while simultaneously keeping weights from becoming too extreme. While practitioners using causal inference know that observations with extreme weights need to be downweighted in the analysis for estimating causal effects, many of the solutions proposed have been ad hoc, while our procedure is more principled.

Although the IHC study in our data analysis example has four treatments, we divided them into two groups and focused on causal inference for binary treatments. It is also desirable to explore how to improve causal inference in the regime of multi-level treatments and extend the work by Imai and Dyk (2004), Tchernis et al. (2005) and McCaffrey et al. (2013).

Notes

Acknowledgments

The authors thank Brian Lee for making his code available. The work of Zhu and Ghosh was supported by the National Institute on Drug Abuse Grant P50 DA010075-16 and NCI Grant CA 129102. The work of Mukherjee was supported by NSF Grant DMS-1007494 and NIH/NCI Grant CA156608. The content of this manuscript is solely the responsibility of the author(s) and does not necessarily represent the official views of the National Institute on Drug Abuse or the National Institutes of Health. Mitra would like to acknowledge Eric Shinohara, MD for making the cholangiocarcinoma data available to us.

Supplementary material

10742_2014_124_MOESM1_ESM.tex (10 kb)
Supplementary material 1 (tex 9 KB)

References

  1. Biau, G., Devroye, L., Lugosi, G.: Consistency of random forests and other averaging classifiers. J. Mach. Learn. Res. 9, 2015–2033 (2008)Google Scholar
  2. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)Google Scholar
  3. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefGoogle Scholar
  4. Breiman, L., Friedman, J., Stone, C., Olshen, R.: Classification and Regression Trees. Chapman & Hall/CRC, Boca Raton (1984)Google Scholar
  5. Brookhart, M.A., van der Laan, M.J.: A semiparametric model selection criterion with applications to the marginal structural model. Comput. Stat. Data Anal. 50(2), 475–498 (2006)CrossRefGoogle Scholar
  6. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)Google Scholar
  7. Freund, Y., Schapire, R.: A desicion-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)CrossRefGoogle Scholar
  8. Hainmueller, J.: Entropy balancing for causal effects: a multivariate reweighting method to produce balanced samples in observational studies. Political Anal. 20(1), 25–46 (2012)CrossRefGoogle Scholar
  9. Harder, V.S., Stuart, E.A., Anthony, J.C.: Propensity score techniques and the assessment of measured covariate balance to test causal associations in psychological research. Psychol. Methods 15(3), 234–249 (2010)PubMedCentralPubMedCrossRefGoogle Scholar
  10. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2009)CrossRefGoogle Scholar
  11. Hoeting, J., Madigan, D., Raftery, A., Volinsky, C.: Bayesian model averaging: a tutorial. Stat. Sci. 14(4), 382–401 (1999)CrossRefGoogle Scholar
  12. Imai, K., Ratkovic, M.: Covariate balancing propensity score. J. R. Stat. Soc.: Ser. B (Statistical Methodology) 76(1), 243–263 (2014)CrossRefGoogle Scholar
  13. Imai, K., Van Dyk, D.: Causal inference with general treatment regimes. J. Am. Stat. Assoc. 99(467), 854–866 (2004)CrossRefGoogle Scholar
  14. Kang, J.D.Y., Schafer, J.L.: Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Stat. Sci. 22(4), 523–539 (2007)CrossRefGoogle Scholar
  15. Kouassi, D.A., Singh, J.: A semiparametric approach to hazard estimation with randomly censored observations. J. Am. Stat. Assoc. 92(440), 1351–1355 (1997)CrossRefGoogle Scholar
  16. Lee, B.K., Lessler, J., Stuart, E.A.: Improving propensity score weighting using machine learning. Stat. Med. 29(3), 337–346 (2010)PubMedCentralPubMedGoogle Scholar
  17. Lin, D., Psaty, B., Kronmal, R.: Assessing the sensitivity of regression results to unmeasured confounders in observational studies. Biometrics 54(3), 948–963 (1998)PubMedCrossRefGoogle Scholar
  18. Lunceford, J.K., Davidian, M.: Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat. Med. 23(19), 2937–2960 (2004)PubMedCrossRefGoogle Scholar
  19. Mays, J.E., Birch, J.B., Starnes, B.A.: Model robust regression: combining parametric, nonparametric, and semiparametric methods. J. Nonparametric Stat. 13(2), 245–277 (2001)CrossRefGoogle Scholar
  20. McCaffrey, D.F., Griffin, B.A., Almirall, D., Slaughter, M.E., Ramchand, R., Burgette, L.F.: A tutorial on propensity score estimation for multiple treatments using generalized boosted models. Stat. Med. 32(19), 3388–3414 (2013)PubMedCentralPubMedCrossRefGoogle Scholar
  21. McCaffrey, D.F., Ridgeway, G., Morral, A.R.: Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychol. Methods 9(4), 403–425 (2004)PubMedCrossRefGoogle Scholar
  22. Mitra, N., Heitjan, D.F.: Sensitivity of the hazard ratio to nonignorable treatment assignment in an observational study. Stat. Med. 26(6), 1398–1414 (2007)PubMedCrossRefGoogle Scholar
  23. Nottingham, Q.J., Birch, J.B.: A semiparametric approach to analysing dose-response data. Stat. Med. 19(3), 389–404 (2000)PubMedCrossRefGoogle Scholar
  24. Olkin, I., Spiegelman, C.H.: A semiparametric approach to density estimation. J. Am. Stat. Assoc. 82(399), 858–865 (1987)CrossRefGoogle Scholar
  25. Pregibon, D.: Resistant fits for some commonly used logistic models with medical applications. Biometrics 38, 485–498 (1982)PubMedCrossRefGoogle Scholar
  26. Ridgeway, G.: The state of boosting. Comput. Sci. Stat. 31, 172–181 (1999)Google Scholar
  27. Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies for causal effects. Biometrika 70(1), 41–55 (1983)CrossRefGoogle Scholar
  28. Rubin, D.B.: Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol. 66(5), 688–701 (1974)CrossRefGoogle Scholar
  29. Setoguchi, S., Schneeweiss, S., Brookhart, M.A., Glynn, R.J., Cook, E.F.: Evaluating uses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiol. Drug Saf. 17(6), 546–555 (2008)PubMedCentralPubMedCrossRefGoogle Scholar
  30. Shinohara, E.T., Mitra, N., Guo, M., Metz, J.M.: Radiation therapy is associated with improved survival in the adjuvant and definitive treatment of intrahepatic cholangiocarcinoma. Int. J. Radiat. Oncol. Biol. Phys. 72(5), 1495–1501 (2008)PubMedCrossRefGoogle Scholar
  31. Stefanski, L.A., Boos, D.D.: The calculus of m-estimation. Am. Stat. 56(1), 29–38 (2002)CrossRefGoogle Scholar
  32. Tchernis, R., Horvitz-Lennon, M., Normand, S.L.T.: On the use of discrete choice models for causal inference. Stat. Med. 24(14), 2197–2212 (2005)PubMedCrossRefGoogle Scholar
  33. van der Laan, M.J., Polley, E.C., Hubbard, A.E.: Super learner. Stat. Appl. Genet. Mol. Biol. 6(1), 1–21 (2007)Google Scholar
  34. van der Laan, M.J., Rose, S.: Targeted Learning: Causal Inference for Observational and Experimental Data. Springer, New York (2011)CrossRefGoogle Scholar
  35. White, H.: Maximum likelihood estimation of misspecified models. Econometrica 50(1), 1–25 (1982)CrossRefGoogle Scholar
  36. Yang, Y.: Adaptive regression by mixing. J. Am. Stat. Assoc. 96(454), 574–588 (2001)CrossRefGoogle Scholar
  37. Yuan, Z., Ghosh, D.: Combining multiple biomarker models in logistic regression. Biometrics 64(2), 431–439 (2008)PubMedCrossRefGoogle Scholar
  38. Yuan, Z., Yang, Y.: Combining linear regression models. J. Am. Stat. Assoc. 100(472), 1202–1214 (2005)CrossRefGoogle Scholar
  39. Zhang, T., Yu, B.: Boosting with early stopping: convergence and consistency. Ann. Stat. 33(4), 1538–1579 (2005)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Yeying Zhu
    • 1
  • Debashis Ghosh
    • 2
  • Nandita Mitra
    • 3
  • Bhramar Mukherjee
    • 4
  1. 1.Department of Statistics and Actuarial ScienceUniversity of WaterlooWaterlooCanada
  2. 2.Department of StatisticsPennsylvania State UniversityUniversity ParkUSA
  3. 3.Department of Biostatistics and EpidemiologyUniversity of PennsylvaniaPhiladelphiaUSA
  4. 4.Department of BiostatisticsUniversity of MichiganAnn ArborUSA

Personalised recommendations