Randomized controlled trials are considered by many to provide the one of the strongest forms of evidence.1 The goal of randomization is to balance treatment groups on any confounding factors (whether observed or unobserved), eliminating treatment selection bias and ensuring that the groups are comparable. But when randomization is impractical, unethical, or impossible, nonrandomized observational studies may be useful. However, these studies are subject to treatment selection bias as patient covariates often confound treatment selection. For example, when determining a course of treatment, an oncologist may choose to pursue more aggressive therapies only in patients with advanced tumors; a nonrandomized study comparing the effectiveness of cancer therapies is thus going to be confounded by the fact that patients receiving aggressive therapies are likely to have worse prognoses and are therefore not be comparable to those receiving less aggressive therapies. Propensity score matching is a statistical procedure for reducing this bias by assembling a sample in which confounding factors are balanced between treatment groups. The paper by Nappi et al.2 published in this issue provides an example of this approach.Footnote 1

In a simple randomized trial, subjects in different treatment groups are comparable because all subjects have the same probability of being assigned to a particular treatment condition. However, in a nonrandomized study, a subject’s probability of receiving a treatment is not known and will depend both his or her observed and unobserved covariates. The propensity score was first proposed by Rosenbaum and Rubin as an estimate of a subject’s probability of receiving a treatment given that subject’s observed baseline covariates.3,4 The key assumption underlying propensity score analysis is that, because the propensity score is estimated using observed baseline covariates, subjects whose propensity scores are equal will have similar baseline covariates values and thus be comparable. Another important assumption necessary for propensity score analysis is that there are no unmeasured confounders. That is, we assume that all factors that might affect treatment assignment and/or the outcome of interest have been observed. The presence of an unmeasured confounder can lead to biased results (see below).

Logistic regression is the most commonly used method for estimating the propensity score5, although more sophisticated data analysis methods are gaining popularity (see Westreich et al.6 for a discussion of alternatives). In the propensity score model, the dependent variable is the (logit) probability of receiving a particular treatment; baseline covariates, particularly any that may be confounders for both treatment selection and the outcome of interest, are included as independent variables.7 A propensity score for each subject in the study is then found by using the fitted model to estimate the probability of receiving the treatment given that subject’s baseline covariates. Once a propensity score for each subject has been estimated, subjects are matched using the propensity scores in order to create a balanced sample.

As a simple example, suppose that an observational study has been conducted comparing survival times for subjects receiving either a new treatment or control (i.e., standard of care). To estimate a propensity score for each subject, we would first fit a logistic regression model to estimate the effect of selected baseline covariates on the probability of receiving the new treatment. Next, propensity scores for each subject would be calculated by plugging that subject’s covariate values into the estimated regression equation to find the subject’s estimated probability of receiving the new treatment. The propensity score-matched sample would then be constructed. For each subject receiving the new treatment, one (for a 1-to-1 match) or multiple (for a many-to-1 match) control subject(s) whose propensity score(s) were equal or close to the propensity score of the treated subject would be chosen as matches for that subject .

Other approaches to propensity score matching approaches are available; for example, instead of matching on the propensity score itself, the propensity score may simply be used to narrow down the pool of potential matches. That is, only control subjects whose propensity scores are within a pre-specified range (or “caliper”) of the propensity score of the treated subject are considered as possible matches. From this subset, the control subject whose covariate values are “closest” to that of the treated subject (according to some measure of distance) is matched with that subject.8 Nappi et al.2 followed this approach, using a caliper of 0.25 times the standard deviation of the propensity score and the Mahalanobis distance9 as their measure of closeness for the propensity scores. See D’Agostino for a thorough review of propensity score matching methods.10

As discussed above, the goal of propensity score matching is to create a sample in which treatment groups are balanced on baseline covariates. Thus, an assessment of covariate balance in the matched sample is a crucial step of the analysis. Rather than conducting statistical tests comparing the covariates values in the two groups, it is recommended that the absolute standardized differences between groups for each covariate be examined.5 The absolute standardized difference (ASD) is defined as

$$ {\text{ASD}} = \frac{{\left| {{{\bar{x}}_{T}} - {{\bar{x}}_{C}} } \right|}}{{\sqrt {\frac{{{S_{T}^{2} }}}{2} + \frac{{{S_{C}^{2}} }}{2}} }}, $$

where \( \bar{x}_{T} \) and S T are the sample mean and standard deviation, respectively, for subjects in the treated group and \( \bar{x}_{C} \) and S C are the sample mean and standard deviation for the control group, respectively. Larger values of ASD indicate greater imbalance in covariate values. Covariate balance may be assessed by comparing the ASD to a pre-specified threshold (<10% is a common choice).

An example of absolute standardized differences before and after propensity score matching is shown in Figure 1. We see that the ASDs for all covariates are smaller after propensity score matching and all below the threshold of 10%, suggesting that the propensity score matching has balanced the treatment and control groups on these covariates.

Figure 1
figure 1

Absolute standardized differences before and after propensity score matching

If covariate imbalance remains after the propensity score matching, the propensity score model should be revised, for example by adding interaction terms and/or other covariates. Once covariates are sufficiently balanced, statistical analysis is conducted using the matched sample.

While on average randomization will balance both observed and unobserved confounding factors, it is important to remember that propensity scores can only balance observed covariates. As a result, statistical inferences may still be subject to bias from unmeasured confounding variables.11 Sensitivity analyses should be conducted to assess how robust study conclusions are to the presence of an unmeasured confounder; see Liu et al. for an introduction to some of the available sensitivity analysis methods.12

Although we have limited our discussion here to propensity score matching, propensity scores may be used in other ways to adjust for covariate imbalance. Instead of using the propensity scores to create a balanced sample, analyses may be conducted on the full sample but with either weighting or stratifying by the propensity score. Another approach is to treat the propensity score as a covariate in regression analyses. For discussion and comparison of other propensity score analyses, see.5,10

In their paper, Nappi et al.2 use a propensity score-matched sample to compare left ventricular shape in diabetic and nondiabetic subjects. This approach allowed the authors to adjust for any observed baseline differences between diabetic and nondiabetic patients that may have confounded analyses of left ventricular shape. Although it cannot replace a true randomized trial, propensity score matching is a powerful tool for adjusting for confounding variables and reducing treatment selection bias.