# Propensity Score Modeling and Evaluation

Chapter
Part of the ICSA Book Series in Statistics book series (ICSABSS)

## Abstract

In causal inference for binary treatments, the propensity score is defined as the probability of receiving the treatment given covariates. Under the ignorability assumption, causal treatment effects can be estimated by conditioning on/adjusting for the propensity scores. However, in observational studies, propensity scores are unknown and need to be estimated from the observed data. Estimation of propensity scores is essential in making reliable causal inference. In this chapter, we first briefly discuss the modeling of propensity scores for a binary treatment; then we will focus on the estimation of the generalized propensity scores for categorical treatment variables with more than two levels and continuous treatment variables. We will review both parametric and nonparametric approaches for estimating the generalized propensity scores. In the end, we discuss how to evaluate the performance of different propensity score models and how to choose an optimal one among several candidate models.

## 1 Propensity Score Modeling for a Binary Treatment

The potential outcomes framework [23] has been a popular framework for estimating causal treatment effects. An important quantity to facilitate causal inference has been the propensity score [22], defined as the probability of receiving the treatment given a set of measured covariates. In observational studies, propensity scores are unknown and need to be estimated from the observed data. Consistent estimation of propensity scores is essential in making reliable causal inference. In this section, we briefly review the modeling of propensity scores for a binary treatment variable.

We first define some notations. Let Y denote the response of interest, T be the treatment variable, and X be a p-dimensional vector of baseline covariates. The data can be represented as (Yi, Ti, Xi), i = 1, , n, a random sample from (Y, T, X). In addition to the observed quantities, we further define Yi(t) as the potential outcome if subject i were assigned to treatment level t. Here, T is a random variable and t is a specific level of T. In the case of a binary treatment, let T = 1 if treated and T = 0 if untreated. The propensity score is then defined as r(X) ≡ P(T = 1 | X). The quantities we are interested in estimating are usually the average treatment effect (ATE):
$$\displaystyle{\mathrm{ATE} = E[Y (1) - Y (0)],}$$
and the average treatment effect among the treated (ATT):
$$\displaystyle{\mathrm{ATT} = E[Y (1) - Y (0)\vert T = 1].}$$

### 1.1 Parametric Approaches

In the causal inference literature, propensity score for a binary treatment variable is usually estimated by logistic regression. Using logistic regression to estimate propensity scores can be easily implemented in R. However, logistic regression is not without drawbacks. First of all, a parametric form of r(X) needs to be specified. Consistent estimation of ATE and ATT relies on the correct logistic regression model. In most cases, only including main effects into the model is not adequate, but it is also hard to determine which interaction terms should be included, especially when the vector of covariates is high-dimensional. In addition, logistic regression is not resistant to outliers [11, 18]. In particular, Kang and Schafer [11] show when the logistic regression model is mildly misspecified, propensity score-based approaches lead to large bias and variance of the estimated treatment effects.

Other parametric approaches for estimating propensity scores include Probit regression modeling and linear discriminant analysis, both of which assume normality. However, through a simulation study, Zhu et al. [31] found that these parametric models give very similar treatment effect estimates.

### 1.2 Machine Learning Techniques

Due to the above-mentioned drawbacks of parametric approaches for modeling propensity scores, more recent literature advocates using machine learning algorithms to model propensity scores [13, 24]. Since in causal inference, propensity scores are auxiliary in the sense that one usually is not interested in interpreting or making inference for the propensity score model, the nonparametric black-box algorithms can be directly used to estimate the propensity scores. Examples are classification and regression trees (CART, [2]) and its various extensions, such as pruned CART, bagged CART, random forests (RF [1]), and boosting [16]. Other classification methods that can indirectly yield class probability estimates include support vector machines (SVM) and K-nearest neighbors (KNN), etc. R packages are readily available, such as rpart for CART; randomForest for RF, twang or gbm package for boosting models, and e1071 for SVM. A detailed review of each approach for estimating propensity scores can be found in [31]. In a simulation study, Zhu et al. found there is a trade-off between bias and variance among parametric and nonparametric approaches. More specifically, parametric methods tend to yield lower bias but higher variance than nonparametric methods for estimating ATE and ATT.

### 1.3 Propensity Score Modeling via Balancing Covariates

Recently, a new propensity score modeling approach termed covariate balance propensity scores is proposed by Imai and Ratkovic [8], which also assumes a logistic regression model, i.e.,
$$\displaystyle{ r(X) \equiv r_{\beta }(X) = \frac{1} {1 + \mbox{ exp}\{ -\beta ^{'}X\}}. }$$
(6.1)
Then, β is solved by satisfying the following condition:
$$\displaystyle{ \mbox{ E}\left \{ \frac{T\widetilde{X}} {r_{\beta }(X)} -\frac{(1 - T)\widetilde{X}} {1 - r_{\beta }(X)} \right \} = 0, }$$
(6.2)
where $$\widetilde{X}$$ is a function of X specified by the researcher. If setting $$\widetilde{X} = \frac{dr_{\beta }(X)} {d\beta }$$, one solves the maximum likelihood estimator (MLE) of β because Eq. (6.2) is the score function for MLE. However, if setting $$\widetilde{X} = X$$, one aims to achieve optimal balance in the first order of the covariates, because this balancing condition implies the weighted mean value of each covariate is the same between the treatment and the control group. If letting $$\widetilde{X} = \frac{dr_{\beta }(X)} {d\beta }$$ and $$\widetilde{X} = X$$ at the same time, there will be more equations than unknown parameters to solve and a generalized method of moments [5] is employed for estimation. The above balancing condition is for the estimation of ATE. For estimating ATT, the balancing condition becomes
$$\displaystyle{ \mbox{ E}\left \{T\widetilde{X} -\frac{r_{\beta }(X)(1 - T)\widetilde{X}} {1 - r_{\beta }(X)} \right \} = 0. }$$
(6.3)
The advantage of this approach is that, by achieving better balance in the covariates, it is less susceptible to model misspecification of the propensity scores, compared to logistic regression.

A related issue is whether we should achieve balance in all the measured covariates in a study or a subset of the available covariates. This is a variable selection issue. Zhu et al. [32] have shown through a simulation study that one should aim to achieve balance in the real confounders, i.e. covariates related to both the treatment variable and the outcome variable, as well as the covariates related only to the outcome variable. Adding additional balancing condition on covariates that are only related to the treatment variable may increase the bias and variance of the estimated treatment effects.

## 2 Propensity Score Modeling for a Multi-level Treatment

In most of the causal inference literature based on potential outcomes framework, researchers have focused on binary treatments. Imbens [10] extended this framework to more general case by defining the generalized propensity score, which is the conditional probability of being assigned to a particular treatment group given the observed covariates. In the past decade, a few studies (e.g., [9, 12, 28]) have extended the propensity score-based approaches to multi-level treatments. Compared with binary treatments, there are two important issues specific to the causal inference with multi-level treatments. The first issue is to define the parameters of interest and to determine whether the parameters are identifiable. As discussed by Imbens [10] and Tchernis et al. [28], for a multi-level treatment, the following parameters may be of interest: (1) the average causal effect of treatment t relative to k, i.e., E[Y (t) − Y (k)]; (2) the average causal effect of treatment t relative to k among those who receive treatment t, i.e., E[Y (t) − Y (k) | T = t] or (3) the average causal effect of treatment t relative to all other treatments among those who receive treatment t, i.e., $$E[Y (t) - Y (\bar{t})\vert T = t]$$, where $$\bar{t}$$ refers to other treatment groups except group t. In any of the three definitions, the multi-level treatment variable is dichotomized; in this sense, causal inference with multiple treatments is essentially an extension of the binary case. Therefore, matching, stratification, or inverse probability weighting methods can be employed to estimate the targeted causal effects in a similar way as in binary treatments. The second issue is that in many studies, the treatments are correlated: the odds ratio of receiving one treatment against the other is affected by whether a third treatment is taken into consideration or not. Tchernis et al. [28] pointed out in a simulation study that if the treatments are correlated, ignoring correlations while estimating propensity scores will lead to biased estimation of the causal effect. The commonly used multinomial logistic regression model does not account for correlation. Therefore, the nested logit model or multinomial probit model has been suggested for modeling propensity scores to allow specification of a correlation matrix among treatments. Due to developments in machine learning methods, nonparametric algorithms such as random forests or boosting algorithms can be easily implemented to estimate propensity scores for multiple treatments.

We define some additional notations here. Let Ti be the treatment status for the ith subject, so Ti = t if subject i was observed under treatment t ∈ { 1, , M}, where there are M total treatment groups. We further define an indicator variable, indicating membership of a particular treatment group t, as Ai(t) = I(Ti = t), t ∈ { 1, , M}. According to Imai and Van Dyk [9], the generalized propensity score is defined as r(t | X) ≡ Pr(T = t | X), for t = 1, , M.

### 2.1 Parametric Approaches

In this section, we describe multinomial logistic regression (MLR), which is an extension of logistic regression to cases where the treatment variable has more than two levels. We now assume an underlying multinomial distribution with a probability of inclusion into each treatment group and use maximum likelihood to find the estimates of the regression parameters. The exact steps are as follows:
1. 1.
We assume the following model for the generalized propensity scores:
$$\displaystyle{r(t\vert X)_{\mathrm{MLR}} = \frac{1} {1 +\sum _{ s=2}^{M}e^{\beta '_{s}X}}\quad \mathrm{for}\quad t = 1}$$
and
$$\displaystyle{r(t\vert X)_{\mathrm{MLR}} = \frac{e^{\beta '_{t}x}} {1 +\sum _{ s=2}^{M}e^{\beta '_{s}X}}\quad \mathrm{for}\quad t = 2,\ldots,M}$$

2. 2.
We maximize the multinomial likelihood function with respect to all the β’s:
$$\displaystyle{L(\beta ) =\prod _{ i=1}^{n}\prod _{ t=1}^{M}r_{ i}(t\vert X)^{A_{i}(t)}}$$
where ri(t | X) follows the model as defined in Step 1. Equivalently, we maximize the log likelihood function:
$$\displaystyle{l(\beta ) =\sum _{ i=1}^{n}\sum _{ t=1}^{M}A_{ i}(t)\log (r_{i}(t\vert X)).}$$

3. 3.

The solution $$\hat{\beta }_{s}$$ for s = 2, , M is substituted into the model to obtain the estimates for the generalized propensity score.

While MLR is a seemingly simple way to estimate the generalized propensity score, there is the question of variable selection and which interactions to be included. In addition, Tchernis et al. [28] pointed out that MLR does not take into account the correlation among treatments in the sense that for two treatment levels ts, we have
$$\displaystyle{ \frac{r(t\vert X)_{\mathrm{MLR}}} {r(s\vert X)_{\mathrm{MLR}}} = e^{(\beta _{t}-\beta _{s})'X},}$$
which does not depend on the information of other treatment levels. This assumption could be violated in real applications, which makes an MLR model not suitable for estimating the generalized propensity scores.

In R, to fit an MLR model, we can use the package nnet [29].

### 2.2 Machine Learning Techniques

In this section, we are going to introduce two machine learning approaches for the modeling of generalized propensity scores: generalized boosted model (GBM) and random forests (RF).

GBM uses an iterative procedure that adds together many simple regression trees to approximate the propensity score function. A regression tree algorithm divides the dataset into two non-overlapping regions based on one of the covariates. Then, it recursively divides each of those regions into two smaller regions, where each split is based on one of the covariates [2]. Note that the splits may occur on a different covariate or the same covariate each time. The splits are chosen so that the prediction error is minimized. After the allowed number of splits have occurred, for each region of the dataset, the estimated response value equals the average response values of the data points within the region.

Now we describe the GBM method for binary treatments, then we extend the procedure to multi-level treatments. McCaffrey et al. [16] provides a detailed algorithm for estimating propensity scores using GBM. In the binary case, let g(X) = log[r(X)∕(1 − r(X))] and the maximum likelihood function can be rewritten as
$$\displaystyle{ l(g) =\sum _{ i=1}^{n}T_{ i}g(X_{i}) -\mbox{ log}\{1 + \mbox{ exp}[g(X_{i})]\}. }$$
(6.4)
To maximize l(g) in (6.4), g(X) is updated at each iteration with g(X) + h(X) where h(X) is the fitted value from a regression tree which models γi = Ti − 1∕{1 + exp[−g(Xi)]}, the largest increase in (6.4). To avoid overfitting, a shrinkage parameter α is introduced so the update is g(X) +α h(X), where α is usually a small value, such as 0.0001. This iterative estimation procedure can be tuned to yield propensity scores that achieve optimal balance in covariate distribution between the treatment and control groups. The key is to stop the algorithm at the optimal number of trees when a certain balance statistic (e.g., average standardized absolute mean difference in the covariates) is minimized. Interactions are automatically included when multi-level splits are allowed in regression trees and since splits are automatically determined by the algorithm based on a criterion, variable selection is automatically done [16].

McCaffrey et al. [17] extended this algorithm to the multi-level treatment case. We first note that while estimating the generalized propensity score for a particular treatment level t, we are interested in the probability that each subject is assigned to a particular treatment t as opposed to any other treatment. So essentially we have two groups: those assigned to treatment t (equivalent to the treatment group in the binary case), and those that were not assigned to treatment t (equivalent to the control group in the binary case). Then we can fit a GBM that balances the covariates between the treatment t group and the entire sample [17]. We do this for each of the M treatments to obtain the generalized propensity scores $$\hat{r}(t\vert X)$$. The estimation of the generalized propensity scores for multi-level treatment can be realized in the R package twang [19].

The downside to this method is that by fitting separate GBMs for all M treatment groups, it is not guaranteed that the generalized propensity scores for each treatment group will add up to 1. McCaffrey et al. [17] justified that estimating the ATE only requires the propensity scores for the particular treatment groups involved, so as long as the estimated generalized propensity scores are not biased, they do not need to add up to 1.

Next, we are going to introduce RF model for estimating the generalized propensity scores. An RF model [1] is built on a collection of classification trees, fitted on bootstrap samples of the original dataset. Classification trees are different from regression trees in that classification trees predict the class label for each input vector of covariates and use nonparametric information criteria, such as Entropy, misclassification rate, or Gini Index, for splitting at each node. The random forest classification tree finds the best split from only a random subsample of the covariates at each node. Then the estimated generalized propensity score for treatment t is the fraction of votes for t from the collection of the random forest classification trees. The specific random forest algorithm for estimating the generalized propensity score is
1. 1.

Draw a random sample with replacement of size n (size of dataset), called a bootstrap sample, from the dataset.

2. 2.

Fit a random forest classification tree to the bootstrap sample.

3. 3.

Repeat steps 1 and 2 a large number, B, times and obtain a collection of B classification trees (usually, B = 500).

4. 4.
For a given vector of covariates X, predict the class label from each fitted tree. The estimated generalized propensity score is then
$$\displaystyle{\hat{r}(t\vert X)_{\mathrm{RF}} = \frac{\mbox{ number of trees that voted for class}\ \mathit{t}}{B} }$$

An issue with this method is that it is possible for none of the trees to vote for a particular treatment, resulting with an estimated generalized propensity score of 0 for that treatment. Another possibility is that all the trees vote for one treatment, resulting with an estimated generalized propensity score of 1 for that treatment. In both cases, the positivity assumption, i.e., 0 < r(t | X) < 1 for all X and t, is violated. In addition, since inverse probability weighting and double-robust estimation involve the reciprocal of the estimated propensity score or one minus the estimated propensity score, an estimated score close to 1 or 0 may result in extreme weights. This issue has been frequently discussed in the literature (e.g., [11, 14, 31]). One way to deal with this issue is to trim extreme weights to a percentile. For example, the inverse probability weights higher than the 95th percentile are set to the 95th percentile. Lee et al. [14] showed that trimming extreme weights gain little benefit in terms of bias, standard error, and 95 % confidence interval coverage, and trimming beyond the optimal level increases bias. Another way to deal with extreme weights is to use a weighted average between a parametric model (such as an MLR model) and RFs as the generalized propensity score estimator [31]. This so-called data-adaptive matching score is
$$\displaystyle{ \hat{r}(t\vert X)_{\mathrm{DAMS}} =\lambda \hat{ r}(t\vert X)_{\mathrm{MLR}} + (1-\lambda )\hat{r}(t\vert X)_{\mathrm{RF}} }$$
(6.5)
where
$$\displaystyle{ \lambda = \frac{\hat{r}(t\vert X)_{\mathrm{MLR}}^{A(t)}[1 -\hat{ r}(t\vert X)_{\mathrm{MLR}}]^{1-A(t)}} {\hat{r}(t\vert X)_{\mathrm{MLR}}^{A(t)}[1 -\hat{ r}(t\vert X)_{\mathrm{MLR}}]^{1-A(t)} + \hat{r}(t\vert X)_{\mathrm{RF}}^{A(t)}[1 -\hat{ r}(t\vert X)_{\mathrm{RF}}]^{1-A(t)}} }$$
(6.6)

As explained by Zhu et al. [31], the intuition of this approach comes from the fact there is a trade-off in bias and variance between parametric and nonparametric approaches. By combining, both bias and variance of the estimated causal effects will be reduced. The choice of λ in (6.6) gives more weight to the estimate that is closer to the observed value of A(t), so it trims extreme weights to more reasonable values without ad hoc adjustment. In addition, it would not attain 0 or 1 as a possible value due to the MLR component.

## 3 Propensity Score Estimation for a Continuous Treatment

Finally, we are going to focus on the case when the treatment variable is continuous. In this case, we are interested in estimating the so-called dose–response function: μ(t) = E[Yi(t)]. We assume Yi(t) is well defined for t ∈ τ, where τ = [t0, t1].

To draw causal inference, we assume the ignorability assumption:
$$\displaystyle{f(t\vert Y (t),X) = f(t\vert X),\quad \mbox{ for}\quad t \in \tau,}$$
where f(t | ⋅ ) refers to the conditional density. In other words, we assume the vector of covariates X include all the real confounders that may jointly affect the treatment and the potential outcomes.
In the continuous treatment case, the generalized propensity score is defined as r(t | X) ≡ ft | X(t | X), which is the conditional density of the treatment level t conditioning on the covariates [10]. The ignorability assumption also implies
$$\displaystyle{f(t\vert Y (t),r(t\vert X)) = f(t\vert r(t\vert X)),\quad \mbox{ for}\quad t \in \tau.}$$
That is, to adjust for confounding, it is sufficient to condition on the generalized propensity scores instead of conditioning on the vector of covariates. In the literature, Robins et al. [20, 21] propose inverse probability weighting based on the marginal structural model to estimate the dose–response function. To obtain consistent estimation, the inverse probability weight for subject i is
$$\displaystyle{ w_{i} = \frac{r(T_{i})} {r(T_{i}\vert X_{i})}\quad \mathrm{for}\quad i = 1,\ldots,n. }$$
(6.7)

However, the estimation of the conditional probability function (generalized propensity score) in the denominator is a non-trivial problem because when X is high-dimensional, the traditional nonparametric approach for estimating conditional density (e.g., [4]) suffers from curse of dimensionality.

### 3.1 Parametric Approaches

Robins et al. [21] proposed a two-step approach to estimate r(Ti | Xi). The treatment variable T is assumed to follow a parametric model:
$$\displaystyle{ T = X^{'}\beta +\epsilon,\quad \epsilon \sim N(0,\sigma ^{2}). }$$
(6.8)
The generalized propensity score can be estimated by first regressing Ti on Xi, i = 1, , n, and get $$\hat{T }_{i}$$ and $$\hat{\sigma }$$; Then, the residuals $$\hat{\epsilon _{i}} = T_{i} -\hat{T } _{i}$$, i = 1, , n, are calculated and r(Ti | Xi) can be approximated by
$$\displaystyle{ \hat{r}(T_{i}\vert X_{i}) \approx f(\hat{\epsilon _{i}}) \approx \frac{1} {\sqrt{2\pi }\hat{\sigma }}\exp \left \{-\frac{\hat{\epsilon _{i}}^{2}} {2\hat{\sigma }^{2}}\right \},\quad i = 1,\ldots,n,. }$$
(6.9)
To be noticed, if T does not follow a normal distribution (which can be checked based on data), we can always employ nonparametric density estimation approaches, such as Kernel density estimation to estimate r(Ti | Xi) using residuals $$\hat{\epsilon _{i}},i = 1,\ldots,n$$.

### 3.2 Machine Learning Techniques

In practice, to ensure there is no unmeasured confounders, researchers usually collect a large number of covariates. In the case when X is high-dimensional, the parametric model (6.8) may not be true. A more general approach is to assume
$$\displaystyle{ T = m(X)+\epsilon,\quad \epsilon \sim N(0,\sigma ^{2}). }$$
(6.10)
where m(X) = E(T | X) and we employ a nonparametric approach to estimate the mean function.
In [30], we advocate a machine learning algorithm, boosting, to estimate m(X). The boosting model for a continuous response variable can be represented as
$$\displaystyle{ m(X) =\sum _{ m=1}^{M}\sum _{ j=1}^{K_{m} }c_{mj}I\{X \in R_{mj}\}, }$$
(6.11)
where M is the total number of trees, Km is the number of terminal nodes for the mth tree, Rmj is the indicator of rectangular region in the feature space spanned by X, and cmj is the predicted constant in region Rmj. Km and Rmj are determined by optimizing some nonparametric information criterion, such as Entropy, misclassification rate, or Gini Index. cmj is simply the average value of Ti in the training data that falls in the region Rmj. Details about how to construct a classification/regression tree can be found in [2].
In boosting, M is a tuning parameter. If M is too large, the model tends to overfit and results in a large variance and if M is too small, bias will occur. In [30], we propose an innovative criterion to determine the value of M. Notice in the inverse probability weighting approach, if subject i receives a weight wi as in (6.7), it means the subject will be replicated wi − 1 times in the weighted pseudo sample. In the weighted sample, if the propensity scores are correctly estimated, the treatment assignment and the covariates are supposed to be unconfounded under the ignorability assumption [21]. Therefore, a reasonable criterion is to stop the algorithm at the number of trees such that the treatment assignment and the covariates are independent (unconfounded) in the weighted sample. Based on this idea, we propose the following procedure to determine the optimal number of trees in [30]:
1. 1.
Calculate $$\hat{r}(T_{i}\vert X_{i})$$ using boosting with M trees. Then, calculate
$$\displaystyle{w_{i} = \frac{\hat{r}(T_{i})} {\hat{r}(T_{i}\vert X_{i})}\quad \mathrm{for}\quad i = 1,\ldots,n.}$$
where $$\hat{r}(T_{i})$$ is estimated by normal density.

2. 2.

For the jth covariate, denoted as Xj, calculate the weighted correlation coefficient between T and Xj using weights wi, i = 1, , n obtained in the first step and denote it as $$\bar{d}_{j}$$;

3. 3.

Average the absolute value of $$\bar{d}_{j}$$ over all the covariates and get the average absolute correlation coefficient (AA C C).

For each value of M = 1, 2, , 20,000, calculate AACC and find the optimal number of trees that lead to the smallest AACC value. In step 2, we employ a bootstrapping approach to obtain the weighted correlation coefficient. Also, we advocate distance correlation coefficient [26, 27] over other correlation metrics. The reason is that the distance correlation takes values between zero and one and it equals zero if and only if T and Xj are independent, regardless of the type of Xj. The R code for calculating AACC is displayed in the Appendix of [30]. After the value of M is determined, the generalized propensity score is estimated by (6.11). More details of this approach can be found in [30].

## 4 Propensity Score Evaluation

Given the buffet of methods available to researchers, it is important to select the best one among all the candidate propensity score models. On the other hand, it is commonly accepted that there is no uniformly best procedure for all the datasets. In this section, we briefly talk about how to evaluate a propensity score model and how to choose an optimal one among several candidate models. We are going to focus on the binary treatment case. One way to evaluate the performance of different propensity score models is to see how close the estimates are to the true propensity scores using simulations. However, Hirano et al. [7] and Lunceford and Davidian [15] showed that conditioning on the estimated propensity score rather than the true propensity score can yield smaller variance of the estimated causal effects. That is, even when the propensity score is estimated more accurately, it does not necessarily yield better causal inference estimates.

### 4.1 Evaluation by Checking Balance

One commonly accepted practice is to check balance after the propensity scores are estimated. The underlying idea is that if the propensity score is correctly estimated, the covariates should be distributed almost the same among different treatment groups. There are many ways to evaluate balance in the covariates and it also depends on the particular approach employed to estimate the causal treatment effect. For example, in inverse probability weighting, we may look at the absolute standardized mean difference (ASMD) in the covariates. For a single covariate X, the standardized mean difference is defined as
$$\displaystyle{ d = \frac{\bar{X}_{\mathrm{treated}}^{w} -\bar{ X}_{\mathrm{control}}^{w}} {\sqrt{(s_{\mathrm{treated } }^{2 } + s_{\mathrm{control } }^{2 })/2}}, }$$
(6.12)
where streated is the standard deviation of X in the treatment group and scontrol is the standard deviation of X in the control (untreated) group; $$\bar{X}_{\mathrm{treated}}^{w}$$ is the weighted average of X in the treatment group and $$\bar{X}_{\mathrm{control}}^{w}$$ is the weighted average of X in the control group. When estimating ATE,
$$\displaystyle{\bar{X}_{\mathrm{treated}}^{w} = \frac{\sum _{i=1}^{n}X_{ i}T_{i}/\hat{r}_{i}} {\sum _{i=1}^{n}T_{i}/\hat{r}_{i}},}$$
where $$\hat{r}_{i} =\hat{ r}(T_{i}\vert X_{i}),i = 1,\ldots,n$$ and
$$\displaystyle{\bar{X}_{\mathrm{control}}^{w} = \frac{\sum _{i=1}^{n}X_{ i}(1 - T_{i})/(1 -\hat{ r}_{i})} {\sum _{i=1}^{n}(1 - T_{i})/(1 -\hat{ r}_{i})}.}$$
When estimating ATT,
$$\displaystyle{\bar{X}_{\mathrm{treated}}^{w} = \frac{\sum _{i=1}^{n}X_{ i}T_{i}} {\sum _{i=1}^{n}T_{i}},}$$
and
$$\displaystyle{\bar{X}_{\mathrm{control}}^{w} = \frac{\sum _{i=1}^{n}X_{ i}(1 - T_{i})\hat{r}_{i}/(1 -\hat{ r}_{i})} {\sum _{i=1}^{n}(1 - T_{i})\hat{r}_{i}/(1 -\hat{ r}_{i})}.}$$
In some literature, the denominator in (6.12) is replaced by streated. We then look at the mean/mediation/maximum value of the ASMD among the covariates and the propensity score model that leads to the smallest value is usually claimed as the best model.

Other criteria to evaluate the balance in the covariates include Kolmogorov–Smirnov statistic [17], t-test statistic [6], and c statistic. Recently, an innovative prognostic score-based balance measurement has been proposed by Stuart et al. [25], which accounts for the information in the outcome variable while checking balance. The approach works as follows: first, a model of the outcome on the covariates is fitted and the predicted outcome if untreated is calculated for each subject in the study, which is termed the prognostic score. Then, the weighted ASMD in the prognostic score is calculated as a measure of balance. The authors show in a comprehensive simulation study that this measurement outperforms the other balance measurements, such as mean/median/maximum ASMD and KS statistic, in the sense that it is highly correlated with the bias in the estimated causal treatment effect.

### 4.2 Evaluation Based on a Two-Stage Procedure

In the propensity score-based approaches, we may treat the estimation of propensity scores as the first stage and the estimation of causal treatment effect using matching, stratification or inverse probability weighting as the second stage. The estimated propensity score can be treated as the input into the second stage. While evaluating a propensity score model, we should focus on the quality of the estimates in the second stage rather than the first stage. The two-stage causal inference procedure also fits the model structure discussed by Brookhart and van der Laan [3]. We denote the causal effect as ψ, which is the parameter of interest, and the propensity score as η, which is the nuisance parameter. Assuming we have K different candidate models for estimating η, we aim to choose the optimal one in terms of estimating ψ. Denote the resulting estimates of ψ from the K candidate models as $$\hat{\psi }_{1}(X)$$,…,$$\hat{\psi }_{K}(X)$$, and assume there exists an approximately unbiased but highly variable estimate of ψ, denoted as $$\hat{\psi }_{0}(X)$$. The model used to estimate η in $$\hat{\psi }_{0}(X)$$ is regarded as the reference model. To account for the fact that there is a trade-off between bias and variance while estimating ψ, the authors proposed a cross-validation criterion for selecting the optimal estimator of the nuisance parameter among the K candidate models. Let Xv0 be the training sample and Xv1 be the testing sample in the vth iteration of the Monte-Carlo cross-validation, the criterion function is defined as follows:
$$\displaystyle{C_{v}(k) = \frac{1} {V }\sum _{v=1}^{V }(\hat{\psi }_{ k}(X_{v}^{0}) -\hat{\psi }_{ 0}(X_{v}^{1}))^{2}.}$$
The optimal model for estimating propensity scores is then chosen to be the one which leads to the smallest Cv among the K models. Brookhart and van der Laan [3] proved that the optimal model selected by the Monte Carlo cross-validation criteria leads to the smallest mean square error of the parameter of interest. This approach has been adopted to compare different propensity score models in [33], in which an over-fitted logistic regression model using all the available covariates is treated as the reference propensity score model to obtain $$\hat{\psi }_{0}(X)$$.

### References

1. 1.
Breiman, L.: Random forests. Mach. Learn. 45 (1), 5–32 (2001)
2. 2.
Breiman, L., Friedman, J.H., Olshen, R.A., Stone C.J.: Classification and Regression Trees Chapman & Hall/CRC, Boca Raton, FL (1984)
3. 3.
Brookhart, M.A., van der Laan, M.J.: A semiparametric model selection criterion with applications to the marginal structural model. Comput. Stat. Data Anal. 50 (2), 475–498 (2006)
4. 4.
Hall, P., Wolff, R.C.L., Yao, Q.: Methods for estimating a conditional distribution function. J. Am. Stat. Assoc. 94 (445), 154–163 (1999)
5. 5.
Hansen, L.P.: Large sample properties of generalized method of moments estimators. Econometrica 50 (4), 1029–1054 (1982)
6. 6.
Hirano, K., Imbens, G.W.: Estimation of causal effects using propensity score weighting: an application to data on right heart catheterization. Health Serv. Outcome Res. Methodol. 2 (3), 259–278 (2001)
7. 7.
Hirano, K., Imbens, G.W., Ridder, G.: Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71 (4), 1161–1189 (2003)
8. 8.
Imai, K., Ratkovic, M.: Covariate balancing propensity score. J. R. Stat. Soc. Ser. B (Stat Methodol.) 76 (1), 243–263 (2014)Google Scholar
9. 9.
Imai, K., Van Dyk, D.A.: Causal inference with general treatment regimes. J. Am. Stat. Assoc. 99 (467), 854–866 (2004)
10. 10.
Imbens, G.W.: The role of the propensity score in estimating dose-response functions. Biometrika 87 (3), 706–710 (2000)
11. 11.
Kang, J.D.Y., Schafer, J.L.: Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Stat. Sci. 22 (4), 523–539 (2007)
12. 12.
Lechner, M.: Program heterogeneity and propensity score matching: an application to the evaluation of active labor market policies. Rev. Econ. Stat. 84 (2), 205–220 (2002)
13. 13.
Lee, B.K., Lessler, J., Stuart, E.A.: Improving propensity score weighting using machine learning. Stat. Med. 29 (3), 337–346 (2010)
14. 14.
Lee, B.K., Lessler, J., Stuart, E.A.: Weight trimming and propensity score weighting. PLoS ONE 6 (3), e18174 (2011)
15. 15.
Lunceford, J.K., Davidian, M.: Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat. Med. 23 (19), 2937–2960 (2004)
16. 16.
McCaffrey, D.F., Ridgeway, G., Morral, A.R.: Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychol. Methods 9 (4), 403–425 (2004)
17. 17.
McCaffrey, D.F., Griffin, B.A., Almirall, D., Slaughter, M.E., Ramchand, R., Burgette, L.F.: A tutorial on propensity score estimation for multiple treatments using generalized boosted models. Stat. Med. 32 (19), 3388–3414 (2013)
18. 18.
Pregibon, D.: Resistant fits for some commonly used logistic models with medical applications. Biometrics 38 (2), 485–498 (1982)
19. 19.
Ridgeway, G., McCaffrey, D., Morral, A., Burgette, L., Griffin, B.A.: Toolkit for weighting and analysis of nonequivalent groups: a tutorial for the twang package. R vignette. RAND, 2015.Google Scholar
20. 20.
Robins, J.M.: Association, causation, and marginal structural models. Synthese 121 (1), 151–179 (1999)
21. 21.
Robins, J.M., Hernán, M.Á., Brumback, B.: Marginal structural models and causal inference in epidemiology. Epidemiology. 11 (5), 550–560 (2000)
22. 22.
Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies for causal effects. Biometrika 70 (1), 41–55 (1983)
23. 23.
Rubin, D.B.: Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol. 66 (5), 688–701 (1974)
24. 24.
Setoguchi, S., Schneeweiss, S., Brookhart, M.A., Glynn, R.J., Cook, E.F.: Evaluating uses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiol. Drug Saf. 17 (6), 546–555 (2008)
25. 25.
Stuart, E.A., Lee, B.K., Leacy, F.P.: Prognostic score–based balance measures can be a useful diagnostic for propensity score methods in comparative effectiveness research. J. Clin. Epidemiol. 66 (8), S84–S90 (2013)
26. 26.
Székely, G.J., Rizzo, M.L.: Brownian distance covariance. Ann. Appl. Stat. 32 (8), 1236–1265 (2009)
27. 27.
Székely, G.J., Rizzo, M.L., Bakirov, N.K.: Measuring and testing dependence by correlation of distances. Ann. Stat. 35 (6), 2769–2794 (2007)
28. 28.
Tchernis, R., Horvitz-Lennon, M., Normand, S.L.T.: On the use of discrete choice models for causal inference. Stat. Med. 24 (14), 2197–2212 (2005)
29. 29.
Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York (2002). ISBN 0-387-95457-0
30. 30.
Zhu, Y., Coffman, D.L., Ghosh, D.: A boosting algorithm for estimating generalized propensity scores with continuous treatments. J. Causal Inference 3 (1), 25–40 (2015)
31. 31.
Zhu, Y., Ghosh, D., Mitra, N., Mukherjee, B.: A data-adaptive strategy for inverse weighted estimation of causal effect. Health Serv. Outcome Res. Methodol. 14 (3), 69–91 (2014)
32. 32.
Zhu, Y., Schonbach, M., Coffman, D.L., Williams, J.S.: Variable selection for propensity score estimation via balancing covariates. Epidemiology 26 (2), e14–e15 (2015)
33. 33.
Zhu, Y., Ghosh, D., Coffman, D.L., Savage, J.S.: Estimating controlled direct effects of restrictive feeding practices in the ‘early dieting in girls’ study. J. R. Stat. Soc.: Ser. C: Appl. Stat. 65 (1), 115–130 (2016)Google Scholar