1 Introduction

Propensity score matching was developed by Rosenbaum and Rubin (1983a, 1983b) and closely follows Rubin’s (1974) framework of potential outcomes. Matching estimators are intended to adjust a given sample of treated and untreated individuals to mimic a random assignment of individuals to a treatment and control groupFootnote 1. Therefore, when applying matching techniques, researchers are conducting a “hypothetical randomized experiment” (Rubin 1986). Nevertheless, statistics does not provide guidelines as to whether the sociological subject matter lends itself to such an interpretation; the researcher and reader must make this decision, and different research traditions may come to different conclusions (see also the discussion in Holland 1986). Matching estimators have been successfully applied to a variety of research questions, including those in sociology, economics, and other social science fields. Propensity score matching is one of the most commonly applied matching estimators in the field.

In contrast to regression methods, propensity score matching is considered a nonparametric method because it does not require the choice of a functional form. Indeed, as in randomized experiments, only a comparison of means is needed after matching on the propensity score. However, in most applications, the true propensity score is unknown and must be estimated from the data. Because the true propensity score is typically estimated using parametric estimators of treatment participation, such as probit and logit models, some scholars refer to propensity score matching as a semi-parametric method (Huber et al. 2012). As demonstrated by recent methodological studies, incorrect specification of the propensity score equation can lead to serious bias. Zhao (2008) investigated the effects of over- and under-specifying the propensity score equation on bias and found that, in either case, the causal effect “is insensitive to specification of the propensity score” (Zhao 2008, p. 313). However, Zhao’s own findings do not fully support this claim; rather, they indicate that matching without replacement on an under-specified propensity score induces bias in two out of three cases. In a Monte Carlo simulation, Millimet and Tchernis (2009) find that over-specifying does not induce bias, but under-specifying the propensity score equation does. In an additional empirical application using real data, Millimet and Tchernis found that the causal effect was sensitive to the specification of the propensity score.

Because the true propensity score is unknown in most applications, there is no way for applied researchers to know whether a model is misspecified. For example, applied researchers seldom acknowledge that estimating the propensity score by logistic regression follows a different set of rules than employing logistic regression to test hypotheses. Whereas applied researchers appear to believe that the estimating the propensity score coincides with explaining the choice for or against participation in the treatment group, methodologists Dehejia and Wahba (2002: 161) note that “the role of the propensity score is only to reduce the dimensions of the conditioning; as such it has no behavioral assumptions attached to it”. Thus, the usual goodness-of-fit tests do not provide meaningful information on how well the matching eliminates the influence of covariatesFootnote 2 (Rubin 2004). In fact, variable selection based on goodness-of-fit tests or model-building algorithms (e. g., forward step-wise regression) often lead to inefficient estimates (Brookhart et al. 2006). Similarly, certain caveats regarding, e. g., multicollinearity do not apply, and tests for significance are only marginally informative on whether to include a covariate in the estimation (c.f., Harding 2003; Rubin 2004). However, because specific guidelines regarding model specification in the context of propensity score matching are scarce, the topic is rarely addressed in applied research. The point we wish to make refers to an even more subtle aspect of model specification than which covariates to include. Here, we draw attention to the issue of correctly specifying the propensity score equation, rather than to the consequences of including or excluding specific covariates.

With this paper, we hope to contribute in several ways to the methodological literature on misspecification of the propensity score. First, whereas most methodological research focuses on the effect of misspecification on propensity score weighting (e. g. Setoguchi et al. 2008), our focus is on propensity score matching. Second, we specifically focus on investigating the performance of an algorithm proposed by Dehejia and Wahba (2002) that is intended to help applied researchers avoid misspecification of the propensity score equation. After its first implementation in the context of propensity score stratification, the algorithm has been severely criticized. In this paper, we propose a modified version in the context of propensity score matching. We argue that these modifications should eliminate the weaknesses of the algorithm while retaining its strengths; that is, it provides researchers with an easy-to-implement way of reducing the danger of bias due to misspecification. Third, we conduct a Monte Carlo simulation to test the performance of the modified Dehejia and Wahba (DW) algorithm. To our knowledge, the algorithm has not been tested in this way. In the Monte Carlo simulation, we compare two alternative proposals, a propensity score stratification approach proposed by Hong (2010) and an alternative augmentation algorithm proposed by Imbens and Rubin (2015).

The paper is structured as follows. Sect. 2 describes propensity score matching and underscores the importance of balancing tests. Sect. 3 introduces the Dehejia and Wahba (2002) algorithm and discusses its weaknesses. Sect. 4 introduces a modified version of the Dehejia–Wahba (2002) algorithm and two alternative estimators that also aim to avoid misspecification bias. Sect. 5 describes a Monte Carlo simulation performed to determine whether the modified DW algorithm avoids bias from misspecified propensity score equations. Sect. 5 presents the results, and Sect. 6 concludes the paper.

2 The propensity score tautology and the importance of balancing tests

In a randomized experiment, individuals are distributed into treatment and control groups, irrespective of their characteristics. Although randomized experiments are becoming increasingly common in many fields of research (e. g., Jackson and Cox 2013), they are often not feasible for reasons ranging from ethical problems to practical issues. In observational data, however, we must eliminate all other causally relevant factors of the outcome ex post to arrive at an unbiased comparison of outcomes from treated and untreated individuals. In other words, we must eliminate the bias introduced by confounding variables to ensure that the observed difference between the treatment and control groups reflects a causal effect. Regression estimators ideally address confounders by including them as conditioning variables and imposing functional form assumptions (e. g., linearity). Matching estimators are an alternative method of addressing confounders without imposing functional form assumptionsFootnote 3.

The simplest and most intuitive matching estimators rely on exact matching. For each treatment individual, an untreated individual must be identified (and vice versa). In exact matching, both individuals are characterized by identical values of the covariate vector, i. e., the vector of confounding variables, which is denoted \(\mathbf{x}\) hereinafter. Given that the subsequent analysis disregards all individuals for whom no matches can be found, exact matching clearly provides researchers with subsamples in which covariates are distributed equally in the treated and untreated groups. Thereafter, the treatment and control groups are considered “balanced”.

Exact matches are more difficult to obtain as more variables are controlled for and become impossible if at least one variable is continuous. Matching solely on the propensity score, rather than the confounders themselves, was developed by Rosenbaum and Rubin (1983a, 1983b) to solve this “curse of dimensionality” (Heckman et al. 1998). The propensity score collapses the information from several variables into a single scalar metric. Here, we focus on the average treatment effect on the treated (att), for which the matching estimator can be expressed as a weighted difference in means (Smith and Todd 2005a):

$$\widehat{att}=\hat{\delta }=\frac{1}{n_{1}}\sum _{i\in I_{1}\cap CS}t_{i}^{1}-\frac{1}{n_{1}}\sum _{i\in I_{1}\cap CS}\sum _{j\in I_{0}\cap CS}w (i,j)t_{j}^{0},$$

where t is a binary treatment indicator; t = 1 denotes the treatment status, and t = 0 denotes the control status. \(I_{1}\) and \(I_{0}\) are individuals in the treatment and control groups, respectively; CS denotes the region of common support in the propensity score distributions of both groups; \(n_{1}\) is the number of individuals in the treatment group within the region of common support; and \(w(i,j)\) is the weight given to observation \(j\) when it is matched to observation \(i\). Different versions of the matching estimators can be built, depending on the choice of \(w(i,j)\). Single nearest-neighbor matching (SNNM) is the most intuitive matching algorithm. In SNNM without replacement, observation \(j\) is chosen as a match for observation i when it is closest to \(i\) in terms of the absolute distance between their propensity scores \(|p (\boldsymbol{x}_{\boldsymbol{i}} )-p (\boldsymbol{x}_{\boldsymbol{j}} )|\). SNNM then weighs the outcome of observation \(j\), whose propensity score is closest to that of observation \(i\), with \(w(i,j)=1\) and assigns a weight of \(w (i,j)=0\) to all other control observations, and the causal effect is then calculated. In multiple nearest-neighbor matching (MNNM), a weighted average of two or more observations \(j\) is built for each treatment individual. Often, a maximum acceptable distance (caliper) is set to avoid matches in which \(p(\boldsymbol{x}_{\boldsymbol{j}})\) is extremely far from \(p(\boldsymbol{x}_{\boldsymbol{i}})\), even though it is the nearest neighbor. Observations not in the CS, i. e., treatment group individuals for whom no matching partner is found and control group individuals that are not used as matching partners, are excluded from the analysis.

The propensity score is defined as the true probability of being in the treatment group, conditional on all confounding variables. However, the true propensity score is generally unknown and is often estimated using logistic regression (or a similar binary outcome model):

$$\hat{p}_{i}\left (\boldsymbol{x}_{i}\right )=\hat{p}_{i}(t_{i}=1|\boldsymbol{x}_{i})=\left (1+\exp \left (-\left (\boldsymbol{x}^\prime_{i}\hat{\boldsymbol{\beta }}\right )\right )\right )^{-1}$$

For the matching estimator to be unbiased, conditional independence must hold; i. e., the vector x must contain all variables that simultaneously influence treatment probability and the outcome of interest (“selection on observables”). Additionally, we need to know which variables to include, and if we rely on an estimated propensity score, the propensity score equation must be specified correctly with regard to the functional form (Ho et al. 2007: 218). Because the true propensity score equation is often unknown, it is difficult to correctly specify the functional form. However, Ho et al. note the so-called “propensity score tautology”. They argue that a correctly specified score will eliminate differences in the distributions of covariates, even if the estimated propensity score cannot be proven to be correctly specified a priori. If we conduct a balancing test and find the sample to be sufficiently balanced after matching on the propensity score, we have found the correct specification. Thus, balancing tests, rather than goodness-of-fit tests, are of the utmost importance when using propensity score matching.

There are several ways to determine whether a sample obtained from propensity score matching is sufficiently balanced. For example, one can test the equality of means between the treatment and control groups with a t-test, as originally proposed by Rosenbaum and Rubin (1985). However, this approach has a major disadvantage in that t‑tests depend on sample size. Matching often reduces sample size, and different procedures lead to different sample sizes. Thus, non-significant test results after matching may occur because the sample has been balanced; moreover, they may occur if matching reduced the sample size but the sample remains imbalanced. Balancing tests based on significance are also criticized because balancing is a property of the specific sample under consideration, rather than the population. If a difference in means in the sample is not significant, it can still be large and therefore lead to bias (Ho et al. 2007)Footnote 4. Another common balancing test involves carrying out the matching procedure and then re-estimating the propensity score equation for the matched sample. If the matching procedure has successfully balanced the covariates, the pseudo-R2 should be near zero and non-significant. Similar to t‑tests, this measure depends on sample size. Large differences between the treatment and control groups can be statistically non-significant in small samples, whereas there is a high probability that even small differences will become statistically significant in large samples.

Computing the standardized difference after matching (Rosenbaum and Rubin 1985) is also a common balancing test. This measure, which is not affected by sample size, is expressed as follows:

$$sdiff=\frac{\bar{x}_{t=0}-\bar{x}_{t=1}}{\sqrt{\left (s_{t=0}^{2}+s_{t=1}^{2}\right )*0.5}}$$

where \(\bar{x}_{t=0}-\bar{x}_{t=1}\) is the difference in the mean for covariate x between the treatment and control groups and \(s^{2}\) indicates the sampling variance for this covariate in each group. The standardized difference is the difference in means in the treatment and control groups, expressed as a percentage of the “average” standard deviation over both groups for each covariate. In their original work, Rosenbaum and Rubin (1985) consider 20% to represent a large bias. Currently, however, biases below 3–5% are considered to indicate sufficient balance in matching analyses (Caliendo and Kopeinig 2008: 48).

3 The Dehejia and Wahba (2002) algorithm for reducing misspecification bias

To improve balance, Rosenbaum and Rubin (1984) were the first to propose refining the propensity score by including squares and interactions. In the appendix to their own paper, Dehejia and Wahba (2002) expanded on this idea and proposed a simple algorithm that can be followed if balance is not reached for a given analysis. Dehejia and Wahba (2002) used results from a randomized experiment (the benchmark causal effect) to compare the performance of different variants of the matching estimator. They were mainly concerned with “whether or not to match with replacement, how many comparison units to match to each treated unit, and (…) which matching mode to choose” (Dehejia and Wahba 2002: 153). However, Dehejia and Wahba (2002) also noted that the specification of the propensity score equation may influence the results and proposed a method of identifying the most appropriate specification.

As described by Diaz and Handa (2006: 325), the DW algorithm “essentially entails adding interaction and higher-order terms to [the] base model until tests for mean differences in covariates between control and comparison units become statistically insignificant” (see also Stuart 2010, p. 7). In detail, the DW algorithm begins by (1) stratifying the sample based on quantiles (e. g., 0–0.2, …, 0.8–1) of a propensity score estimated with a parsimonious logistic regression (that is, a specification containing only linear terms). The balance is checked within each stratum by applying a t-test for the equality of means. If the covariates are not balanced for some strata (i. e., the t‑test is statistically significant), (2) the sample should be divided into finer strata, and a new balancing test should be conducted. If these finer strata are not balanced, they recommend that (3) the logistic regression should be modified by “adding interaction terms and/or higher-order terms of the covariates” (Dehejia and Wahba 2002, p. 161), starting with the least balanced variable, until balance is achieved. Comparing the treatment effects obtained by applying their algorithm to a benchmark estimate from a randomized experiment, Dehejia and Wahba (2002) concluded that their algorithm is successful. Both Dehejia and Wahba (2002) and Rosenbaum and Rubin (1984) discussed their augmentation algorithms in the context of a propensity score stratification procedure, not in the context of matching on the propensity score.

The Dehejia and Wahba algorithm has several disadvantages and has thus been criticized. Smith and Todd (2005a) criticized the algorithm’s lack of objective criteria for choosing and refining the initial strata. This lack of objective criteria is problematic because smaller strata include fewer cases; thus, the power of the test is lower. As a consequence, t‑tests become insignificant, even if the bias is still substantial. Additionally, Smith and Todd (2005b) criticize the use of balancing tests per se because they lack formal criteria for determining when the balance is sufficient. In line with this argument, Lee (2013) demonstrated that balancing tests display size problems. For the DW algorithm, he found that the t‑test for balance led to rejection in 23.8% of tested cases, instead of the conventional 5%. To alleviate these high rejection rates, Lee (2013) developed a permutation version of the traditional t‑test. This updated test leads to test sizes of 3.5% for the DW algorithm; thus, it is rather conservative. Lee (2013) also considered the standardized difference, applied a 20% threshold for indicating imbalance, and found a rejection rate of nearly 100%, instead of 5%. Unfortunately, the permutation test developed by Lee (2013) is not applicable to standardized differences. However, the applied threshold is two to three times as large as the 3–5% threshold proposed by Caliendo and Kopeinig (2008: 48) and is thus too high; this fact partially explains the poor performance of the standardized difference in Lee (2013). Finally, Iacus et al. (2012: 21) used an empirical example to show that augmenting the propensity score equation can increase bias, as well as decrease it. To our knowledge, except for an unpublished manuscript by Smith and Zhang (2007), no Monte Carlo simulations have been conducted to test the performance of the DW algorithm. Furthermore, a large part of the criticism refers only to the application of the DW algorithm in the context of propensity score stratification and/or propensity score weighting, thus it is not clear if matching estimators are also subject to the same disadvantages.

4 A modified version of the Dehejia and Wahba (2002) algorithm and two recent alternatives

Instead of abandoning the idea of augmenting the propensity score equation altogether, we propose to modify the DW algorithm in a way that eliminates the above problems but keeps the basic idea of augmentation intact. We are not the first to extend the algorithm to propensity score matching. Indeed, some applied researchers already have employed some of the modifications we propose (see, for example, Diaz and Handa 2006: 325). However, we claim that we are the first to present a systematic argument for such modifications and to systematically test the performance of the modified algorithm using Monte Carlo simulations.

The modification that we propose and test begins (1) with a main-effect logit or probit specification. The specification should include all covariates that are necessary to fulfill the conditional independence assumption. (2) Propensity matching, which is performed instead of stratification, is conducted using a standard matching algorithm (e. g., nearest-neighbor matching with a caliper). By restricting the augmentation algorithm to propensity score matching, instead of propensity stratification, the subjectivity in determining the number of strata that was criticized by Smith and Todd (2005a) is eliminated. In addition, Austin (2009) found that propensity score matching outperforms propensity score stratification in terms of producing a balance between treatment and control groups. (3) After matching, we determine the standardized difference for all covariates (rather than performing t‑tests), the least balanced variable is identified, and the corresponding standardized difference is recorded. By using standardized differences instead of significance tests, the sensitivity of the balancing checks to sample size is alleviated. (4) Steps 1–3 are repeated several times, and the propensity score equation is augmented each time with another interaction term and/or a higher-order termFootnote 5. This step contrasts with that of the original algorithm, which only augments the equation if a t-test indicates imbalance. Because it is unclear what constitutes sufficient balance, we augment the equation repeatedly and select the specification that produces the best balance. This procedure should also avoid the problem that some augmented specifications reduce balance, rather than improving it (Iacus et al. 2012: 21). (5) Among all of the tested specifications, the specification that has the lowest value for the standardized bias in step three is identified. By defining best balance in terms of the maximum imbalance among all variables, the bias due to the worst balanced variable is minimized. (6) The causal effect is estimated using the specification identified in step 5.

A different route was taken by Imbens and Rubin (2015), who proposed an alternative augmentation algorithm. In contrast to the DW algorithm, they do not rely on the propensity score tautology to select the propensity score equation. Rather, their algorithm is based on step-wise logistic regression. However, the algorithm tests whether each individual covariate should be included in the propensity score equation at all, in addition to augmenting the existing set of covariates to avoid misspecification bias. Therefore, the algorithm conducts a broader specification search than the DW algorithm by including or excluding entire variables, based on the strength of their association with the treatment. As described in Imbens (2015), the algorithm starts with selecting a subset of covariates that should be included in the propensity score equation, irrespective of the strength of their association with the treatment. Additional covariates, as well as additional second-order terms, are then included in the propensity score equation, if they pass a specified threshold value. More specifically, the decision to include an additional term is based on the likelihood ratio test statistic of the null hypothesis that the coefficient from a logistic regression predicting treatment assignment is equal to zero. The threshold values are recommended based on simulation analysis, i. e., 1 for additional covariates and 2.71 for additional second-order terms Imbens (2015).

In contrast to both Dehejia and Wahba (2002) and Imbens and Rubin (2015), the estimator proposed by Hong (2010) does not use augmentation. It is based on the potentially misspecified main effects propensity score equation. Hong (2010) focused primarily on extending propensity score stratification to the case of multi-valued treatments. However, this author also argued that the “[marginal mean weighting through stratification (MMWS)] method usually provides a better approximation of nonlinear or nonadditive relationships between treatment assignment and pretreatment covariates” (Hong 2010, p. 523) and therefore is robust to incorrect specification of the functional form of the propensity score equation.

Hong’s (2010) MMWS estimator combines weighting and propensity score stratification. First, the sample of observations is divided into a number of strata, based on the estimated and potentially misspecified propensity score. The strata are chosen such that the number of observations in each stratum is approximately equal. As a consequence of this stratification, within each stratum, the distribution of the covariates should be similar in both the treatment and control groups. Marginal mean weights (mmw) are computed based on the following equation:

$$mmw=\frac{n_{s}P\left (T=t\right )}{n_{t,s}}$$

where \(P(T=t)\) is the probability of receiving treatment version t. In the case of binary treatments, only two versions exist, treatment and no treatment; however, the method can be extended to multiple ordered and unordered treatments. \(n_{s}\) is the total number of observations in stratum s, whereas \(n_{t,s}\) is the number of observations subjected to treatment version t within the same stratum s. To estimate the causal effect, these stratum-specific weights are applied to all observations within the common support.

Hong (2010) supported this argument using a Monte Carlo simulation. More recently, Linden (2017a) compared Hong’s (2010) version to standard propensity score stratification and found the former to be slightly more robust to misspecification than the latter; both methods outperform inverse probability weighting. Linden (2017a; see also Linden 2017b for a similar analysis) appears to both support Hong’s (2010) methodological claim and to indicate that the need for augmentation is not as strong as claimed by the methodologists Dehejia and Wahba (2002) and Imbens and Rubin (2015). However, in a simulation with a more complex design, Linden et al. (2016) found that the bias in MMWS is often similar to the bias in inverse probability weighting, which can be substantial. No comparison to propensity score matching with or without augmentation has been conducted so far.

5 Monte Carlo experiment

We conduct a Monte Carlo simulation to investigate the performance of the modified DW algorithm and other estimators in the presence of misspecification of the propensity score. We use two different simulation structures that have been developed for similar purposes.

Simulation 1 follows a setup developed by Setoguchi et al. (2008) that has been modified only slightly for our purposes. The simulation involves a continuous outcome variable y and a binary treatment variable t (1 if treated, 0 if control) where p(t = 1) = \(\sim\)0.5. Four covariates (\(x_{1}\), \(x_{2},x_{3}\) and \(x_{4})\) are correlated with both the treatment and the outcome, three covariates are correlated only with the treatment (\(x_{5},x_{6}\) and \(x_{7}\)), and another three covariates are correlated only with the outcome (\(x_{8},x_{9}\) and \(x_{10}\)). The covariates are generated such that two groups of some covariates (\(x_{2}\) and \(x_{6}\); \(x_{4}\) and \(x_{9}\)) are highly correlated (0.9) with each other, but not with any of the other covariates. Other covariates are only moderately correlated (0.2) with each other (\(x_{1}\) and \(x_{5}\); \(x_{3}\) and \(x_{8}\)) and also not correlated with the other covariates. The remaining correlations are set to 0. All of the covariates are generated as standard normal random variables, but they are dichotomized after introducing the correlations (\(x_{1}\), \(x_{3},x_{5}\), \(x_{6},x_{8}\) and \(x_{9})\). We implement a continuous outcome variable, following Lee et al. (2010), in which \(y_{i}=0.4t_{i}+\boldsymbol{x}\mathbf{`}_{i}\boldsymbol{\alpha }+\varepsilon _{i}\); however, the coefficient vector \(\boldsymbol{\alpha }\) is the same as in Setoguchi et al. (2008). The random error term \(\varepsilon\) is not correlated with any of the covariates, and \(\varepsilon _{i}\sim N(0,1)\), leading to an \(R^{2}\) of approximately 0.3.

The treatment indicator t (1 if treated, 0 if control) is generated from a binomial distribution with probability \(p_{i} (t_{i}=1)=(1+\exp(-(f(\boldsymbol{x}\mathbf{`}_{i})\boldsymbol{\beta })))^{-1}\), with \((i=1,\ldots ,N)\), where the function \(f(\boldsymbol{x}\mathbf{`}_{i} )\) is specified such that the propensity score equation is characterized by increasing degrees of nonadditivity and nonlinearity from Scenarios A to G. For example, in Scenario A, the true propensity score equation contains only main effects, such that \(p_{i}=(1+\exp (-\, (\upbeta _{0}+\upbeta _{1}\mathrm{x}_{\mathrm{i}1}+\upbeta _{2}\mathrm{x}_{\mathrm{i}2}+\upbeta _{3}\mathrm{x}_{\mathrm{i}3}+\upbeta _{4}\mathrm{x}_{\mathrm{i}4}+\upbeta _{5}\mathrm{x}_{\mathrm{i}5}+\upbeta _{6}\mathrm{x}_{\mathrm{i}6}+\upbeta _{7}\mathrm{x}_{\mathrm{i}7})))^{-1}.\) In Scenario G, the propensity score equation features 10 two-way interaction terms and 3 quadratic terms, a situation Setoguchi et al. (2008) call moderate nonadditivity and nonlinearity (see the appendix for a detailed listing).

The simulation is conducted by generating a population of size N, where N takes the values of 200, 1000, and 2500. A logistic regression analysis is conducted to estimate the propensity score \(\widehat{p_{i}}\) for each sample. Each Monte Carlo simulation consists of R = 1000 replicates of the process, from generating the population N to estimating the causal effect from the sample of N observations. We chose to estimate the average treatment effect on the treated, since this is most common in empirical applications.

The estimators we compare differ in several regards. They differ in terms of the specification of the propensity score equation (i. e., main effects specifications vs. augmentation algorithms), whether or not a treatment predictor is omitted and whether propensity score matching or (in one case) propensity score weighting is used. We first start with a main effects specification that does not omit a treatment predictor but is misspecified with regard to the functional form of the propensity score equation in all but Scenario A. This estimator is our benchmark because it shows us to what degree bias arises when only the functional form of the propensity score equation is misspecified. The second estimator uses a main effects specification for the propensity score equation but omits variable \(x_{2}\), i. e., a variable associated both with the treatment and the outcome. The third estimator differs from the second one only in that it omits variable \(x_{7}\), a variable that is associated only with the treatment and not with the outcome. By comparing the results obtained using estimators two and three with those obtained using estimator one, we can assess the size of the bias caused by misspecification. Causal effects are estimated by way of propensity score matching in all three cases, and we choose the SNNM without replacement with a caliper of 0.01. To estimate the causal effect, we use the Stata add-on “psmatch2” (Leuven and Sianesi 2014).

The fourth estimator, mean marginal weighting through stratification (MMWS), relies on weighting, rather than propensity score matching. To implement MMWS, we first use the Stata add-on “pstrata” (version 1.1.1; Linden 2016) to find the “optimal” number of strata, which corresponds to finding no significant differences in the mean propensity scores (p = 0.05) between the treatment and control groups within each stratum. The algorithm starts with 2 strata and tests for the equality of means. If significant differences remain, the sample is divided in 3 strata and the test is repeated, etc. The algorithm stops either when no significant differences exist or when one stratum contains no treatment or no control observations. The solution is then passed to the Stata add-on “mmws” (version 1.21; Linden 2014) to estimate the causal effect using the respective weights.

Only the fifth and sixth estimators use augmentation to find the preferred specification. Starting with the fifth estimator, the modified Dehejia and Wahba algorithm is implemented as described in Sect. 4. Dehejia and Wahba recommend including interactions and higher-order terms for unbalanced variables, although there is no guarantee that the resulting propensity score will balance the sample. Rather, analysists shouldtest for the equality of means and modify the equation if significant differences remain. Only in that case, the score is re-estimated, and a new balancing test is performed. This process is repeated until analysists find a specification that is sufficiently balanced. Here, we modify the algorithm to always augment the equation step by step, including first the squares of each variable and then all possible first-order interactions one by one. This procedure yields several different estimates. The number of these estimates depends on the number of covariates. From the various specifications we choose the version for which the highest standardized difference among the variables correlated with the treatment is the lowest of the consecutive estimates (i. e., we minimize the maximum bias from the consecutive propensity score estimates).

As implemented here, the modified DW algorithm systematically estimates several logistic regressions, where the interactions and squares are applied to all covariates, not only those that are deemed unbalanced. Because we restrict ourselves to first-order interactions and squares, the modified DW algorithm does not cover all possible specifications of the propensity score. Furthermore, once a square or an interaction term is included, it is no longer excluded from the estimation of the propensity score. Thus, the simulation results are conservative because some specifications that might reduce the balance even further are not considered. Additionally, we stress that none of the specifications coincide with the true one in either Simulation 1 or in Simulation 2 below. This procedure reflects a situation in which researchers remain agnostic about the true specification and follow what subject matter researchers would consider a “mindless” strategy; that is, the interactions and squares are included, regardless of any theoretical justification.

To implement the Imbens – Rubin (2015) algorithm, we use the Stata add-on “psestimate” (Version 1.5.3, Carril 2016). We use the default settings for the threshold values, as they are the ones recommended by Imbens and Rubin (2015). However, the Stata add-on “psestimate” unfortunately does not permit the selection of some covariates to be part of the propensity score equation, regardless of the strength of their correlation with the treatment.

Simulation 1 stops at Scenario G, a situation characterized by Setoguchi et al. (2008) as moderate nonadditivity and nonlinearity in the propensity score equation. In a second simulation (Simulation 2), we model a more extreme form of nonadditivity and nonlinearity. Furthermore, we also investigate the consequences of nonadditivity and nonlinearity in the outcome equation for the performance of the modified DW algorithm. To do this, the second simulation structure follows a setup developed by Kang and Schafer (2007) that has been slightly modified for our purposes. In contrast to Simulation 1, this setup includes fewer covariatesFootnote 6 but allows for nonlinearities and nonadditivity in both the outcome and propensity score equations. In addition, the nonlinearity and nonadditivity in both equations is more complex than those in Simulation 1, explaining why we refer to it as strong nonadditivity and nonlinearity. Four covariates (\(z_{1}\), \(z_{2},z_{3}\) and \(z_{4})\) are generated following a standard normal distribution and subsequently transformed, such that

$$\mathrm{x}_{\mathrm{i}1}=\exp \left (\frac{\mathrm{z}_{\mathrm{i}1}}{2}\right )$$
$$\mathrm{x}_{\mathrm{i}2}=\frac{\mathrm{z}_{\mathrm{i}2}}{1+\exp \left (\mathrm{z}_{\mathrm{i}1}\right )}+10$$
$$\mathrm{x}_{\mathrm{i}3}=\left (\frac{\mathrm{z}_{\mathrm{i}1}\mathrm{z}_{\mathrm{i}3}}{25}+0.6\right )^{3}$$
$$x_{i4}=\left (z_{2}+z_{4}+20\right )^{2}$$

This setup is used to compare four estimators. First, we analyze the performance of a propensity score matching estimator in which the analyst guesses the correct specification of the propensity score equation. Second, we analyze the performance of propensity score matching based on a misspecified main effects propensity score equation. Third, we examine whether marginal mean weighting based on the misspecified main effects propensity score equation eliminates bias associated with potential misspecification. Fourth and fifth, we investigate the performance of both the Dehejia and Wahba and the Imbens and Rubin augmentation algorithms.

We distinguish two scenarios. In Scenario 1, the outcome equation is based on \(z_{1}\) to \(z_{4}\) such that nonlinearity and nonadditivity is only present in the propensity score equation; \(\mathrm{y}_{\mathrm{i}}=2.1+0.4t_{i}+2.74\mathrm{z}_{\mathrm{i}1}+1.37\mathrm{z}_{\mathrm{i}2}+1.37\mathrm{z}_{\mathrm{i}3}+1.37\mathrm{z}_{\mathrm{i}4}+\upvarepsilon _{\mathrm{i}}\). In Scenario 2, the variables \(x_{1}\) to \(x_{4}\) are substituted for the z‑variables in the outcome equation, thus keeping the coefficients but causing the outcome equation to also be characterized by strong nonlinearity and nonadditivity.Footnote 7

Simulation 2 is also conducted by generating a population of size N, where N takes the values 200, 1000, and 2500, a logistic regression analysis is conducted to estimate the propensity score \(\hat{p}_{i}(t_{i}=1)\)for each sample. Moreover, the average treatment effect on the treated is estimated and averaged over the 1000 replications of the simulation. The implementations of PSM, MMMS and the two augmentation algorithms follow those in Simulation 1.

6 Results

In this section, we present the results from the two simulations, both of which are conducted with 200, 1000 and 2500 observations. In Simulation 1, Scenarios A to G differ with regard to the extent of nonlinearity and nonadditivity in the true propensity equation, but the outcome equation is always linear and additive. We compare the effects of the different estimation strategies on the bias of the causal effect estimate, specifically the average treatment effect on the treated. We first discuss the results presented in Table 1, which are based on simulations with 1000 observations, and then compare these to the results for smaller and larger samples that are presented in Table 2 and 3, respectively.

Table 1 Results for Simulation 1, N = 1000
Table 2 Results for Simulation 1, N = 200
Table 3 Results for Simulation 1, N = 2500

We start with results for propensity score matching based on a main effects logistic regression that includes all seven covariates. In Scenario A, this specification coincides with the true specification, and we find only a small absolute bias. In Scenarios B–G, the true specification is characterized by increasing nonlinearity and nonadditivity, and therefore the main effects logistic regression becomes increasingly misspecified. However, contrary to expectations, the bias does not increase with the degree of misspecification. This result contrasts with that obtained for propensity score weighting performed using the same simulation setup (Setoguchi et al. 2008). Therefore, it appears that misspecification bias is less of a problem for propensity score matching than other propensity score methods, especially weighting.

Compared to misspecification bias, omitting an important covariate leads to a substantial bias of around 10%. In line with statistical theory and previous research, we find this omitted variable bias arises only when the variable is associated with both the treatment and the outcome, but not when the omitted variable is associated only with the treatment.

Even if misspecification bias is generally small, there are some differences between estimators, both between those that do and do not omit treatment predictors and between those that are based on main effects propensity score equations and those that rely on augmentation algorithms. In the special case were the main-effect logit is correctly specified, we find that matching on the propensity score leads to a small absolute bias of 1.6%. The bias is similar for marginal mean weighting (1.9%). However, the modified DW augmentation algorithm has a bias of only 1%. Thus, re-estimating the propensity score after including additional nonlinear terms further reduces the already small bias compared to the correctly specified equation. This result occurs because several attempts at propensity score matching are made per replication, and the algorithm chooses the one with the lowest bias. A similar reduction in bias is achieved by the IR augmentation algorithm. In Scenarios B to G, the misspecification bias remains small, although the degree of nonlinearity and nonadditivity is greater. For most of the tested specifications, MMWS does not reduce the bias and sometimes increases it; the absolute bias reaches 2.5% in Scenario E. In contrast, the two augmentation algorithms mostly continue to reduce bias.

Over all of the tested scenarios, the standard errors are similar for all estimators. In contrast, the mean standardized differences diverge between estimation strategies. First, even when propensity score matching is applied with misspecified main effects, the mean standardized difference is between 3 and 4% and thus well within the range considered to indicate sufficient balance in a matching analysis (Caliendo and Kopeinig 2008: 48). Because these low values are accompanied by only small misspecification bias, there is no contradiction. Also, bias reduction using the DW algorithm is accompanied by a simultaneous reduction in the mean standardized differences, which are reduced to below 2%. However, although marginal mean weighting also reduces the mean standardized bias in the covariates compared to the misspecified propensity score matching, this change is not always accompanied by reductions in bias. Notably, in the case where a treatment predictor is omitted, the standardized difference remains low with a maximum of 5%, even in the case of omitted variable bias. On the other hand, the standardized difference is quite high, reaching almost 12%, even if omitting a variable does not induce bias.

Comparing the results for 1000 observations to those for the very small sample of 200 observations (Table 2) shows that the standard errors, as well as the mean standardized bias in the covariates, are higher, as expected. The omitted variable bias does not seem to depend strongly on sample size. Although the misspecification bias is larger in the smaller sample, it is still rather small. Also in the small samples, the bias does not change substantially, even when the degree of misspecification increases. Again, the DW and IR algorithms slightly improve upon both the correct main effects specification and the misspecified main effects logit regressions. However, when the degree of misspecification increases, the performance of both augmentation algorithms decreases. In contrast to the medium-sized sample, in smaller samples, the marginal mean weighting does not reduce the bias but instead increases it; the absolute bias often reaches 5 to 6%. Both the standard errors and mean standardized differences are generally higher in the smaller sample than in the medium-sized sample, as is to be expected. If the sample size increases to 2500 observations (Table 3), we find that the misspecification bias mostly disappears, regardless of the degree of misspecification and whether augmentation is used or not. The standard errors and mean standardized differences are very small.

In Simulation 2, where both the true propensity score and the outcome equation are allowed to contain high levels of nonlinearity and nonadditivity, the results are quite different from those of Simulation 1. Again, we start by discussing the results for the case with 1000 observations shown in Table 4. In Scenario I, the outcome equation is strictly linear and additive, whereas the true propensity score equation is not. In such cases, propensity score matching that uses the correct specification to estimate the propensity score is only slightly biased (by 1%), and this degree of bias is similar to that obtained with the correct main effects specification in Simulation 1. In contrast to Simulation 1, however, we find that the misspecified main effects estimation of the propensity score equation leads to a substantially larger absolute bias of 13%. This bias is virtually unchanged by marginal mean weighting. This result contradicts the claims of both Hong (2010) and Linden (2017a, 2017b) that stratification alleviates or even eliminates misspecification bias, but it is in line with results obtained in Linden et al. (2016). In contrast, the DW algorithm reduces the absolute bias by about half (to 7%), whereas the IR algorithm reduces the bias to 9%. It seems that, in the case of higher levels of nonlinearity and nonadditivity, misspecification of the propensity score equation actually does lead to considerable bias. Both of the augmentation algorithms, but not MMWS, are able to reduce the bias substantially, but they are not able to eliminate it altogether.

Table 4 Results for Simulation 2, N = 1000
Table 5 Results for Simulation 2, N = 200
Table 6 Results for Simulation 2, N = 2500

Turning to Scenario II, where the true outcome equation is strongly nonlinear and nonadditive, propensity score matching is biased by approximately 11%, even when it is based on a correctly specified model. This result is in line with that of Kang and Schafer (2007) for weighting and double robust estimators. If propensity score estimation is additionally based on a misspecified main effects logit, the estimate is severely biased, and the absolute bias becomes 77%. Surprisingly, MMWS is as biased in Scenario II as in Scenario I; i. e., the absolute bias is 16%. Whereas both augmentation algorithms perform similarly in Scenario I, the IR algorithm does not reduce the bias; instead, it increases it slightly (to 82%) in Scenario II. In contrast, the DW algorithm leads to a strong decrease in the bias and leads to an absolute bias of merely 2.6%. The suboptimal performance of the IR algorithm is, however, most likely attributable to the specific implementation in stata. The user written program does not allow for retaining specific variables in the propensity score equation, irrespective of their correlation with the treatment. This restriction might have lead the algorithm to eliminate variables from the propensity score equation that have only small correlation to the treatment, but if the same variables are highly correlated to the outcome, their exclusion might still be problematic.

Similar to Simulation 1, the relationship between the degree of bias and the mean standardized difference is not unambiguous. For both the unbiased and the (moderately and severely) biased estimators, the values of the mean standardized bias are similar and well within the accepted range of 2–5%. In contrast, the mean standardized bias is in fact lowest for the DW algorithm, which is the least biased estimator. We interpret this result as evidence for the point made in Ho et al. (2007), who argued that even small standardized differences should be avoided for unbiased estimation.

The observed pattern does not change when the results from Simulation 2 obtained with populations of 1000 observations are compared with those obtained using much smaller or larger samples (Tables 5 and 6). Regardless of sample size, the DW algorithm performs best among the selected estimators, even if the outcome equation is nonlinear and nonadditive.

7 Conclusions

In this paper, we tested the performance of augmentation as proposed by Dehejia and Wahba (2002) to reduce bias due to misspecification. We proposed to slightly modify the original algorithm to alleviate the problems pointed out by its critics. The original algorithm involves systematically introducing interactions and higher-order terms while checking whether balance is improved. There is, however, no guarantee that balance will improve, even if the introduced interactions are theoretically sound. Therefore, the modified Dehejia and Wahba–algorithm proposes to test several specifications and select the one that produces the best balanced sample. The step-wise cumulative augmentation of the logistic regression by introducing interactions and higher-order terms is a simple and convenient strategy to reduce bias.

Based on the two preceding Monte Carlo simulations, we can draw several conclusions. From the first simulation, we learned that compared to omitting a relevant variable, misspecifying the functional form in the model that estimates the propensity score induces only small bias. Even in cases where the true equation is characterized by moderate nonlinearity and nonadditivity, misspecification is often negligible, as long as the outcome equation is linear and additive. However, the modified Dehejia and Wahba (2002) algorithm still helps to further reduce the misspecification bias. An alternative augmentation algorithm suggested by Imbens and Rubin (2015) performed similarly, whereas a variant of propensity score stratification proposed by Hong (2010) performed slightly worse, especially in small samples.

From the second simulation we learned that in case of high nonlinearity and nonadditivity, misspecification bias can be quite severe. If the functional form of the propensity score equation is misspecified, but the outcome equation is linear and additive, misspecification bias is quite substantial. If in addition, the outcome equation is also highly nonlinear and nonadditive, bias becomes severe. In both cases, however, we found the modified Dehejia and Wahba–algorithm to reduce bias markedly, whereas the Imbens and Rubin (2015) algorithm only reduced bias if the outcome equation was linear and additive.

In all, misspecification will not always induce bias, especially when the true equation is only moderately nonlinear and nonadditive. Because the true specification of the propensity score equation is unknown, it seems prudent, however, to always take measures to avoid misspecification bias. We found the modified Dehejia and Wahba–algorithm to do a good job in reducing bias, especially compared to propensity score stratification and often also compared to the recent proposal by Imbens and Rubin (2015).