Introduction

Over the past several decades, significant advances in statistical, epidemiologic, and econometric methods for estimating causal effects have been made. These new methods provide useful tools for improving the efficiency of data analysis. However, the introduction of these methods to applied research has not been uniform across scientific disciplines, and even within a field of study, causal inference methods have often been limited to particular application areas. This is a problem because the application of methods which have been developed based on theory to the messy world of actual data can be challenging, and the complexities of data differ between fields of specialization. An economist and a public health researcher cannot necessarily expect to encounter the same types of analytic problems nor to analyze their data using the exact same means and underlying assumptions when the goals of their research are often different. As a result, tutorial material specifically tailored to public health and medical questions are needed in order to address the types of data and analytic issues that arise in these fields and connect econometric methods with other methods more well-known in epidemiological research. Here, we provide a step-by-step walk-through of using the difference-in-differences method for epidemiologists and medical researchers. We apply this method to estimate the impact of a department-level intervention on surgical safety outcomes at the department-level within a single hospital.

The difference-in-differences (DiD) method, although originally developed by an epidemiologist [1], is a widely used tool for estimating causal effects in econometrics. The DiD method compares changes in an outcome over time between an intervention and control group and is thus ideal for estimating the effects of policies or interventions that are applied at the group level rather than at the individual level. An important caveat to this method is that rather than estimating the average effect of the intervention on the outcome in the entire population (which is a common estimand in epidemiologic research), DiD estimates the average effect that the intervention has had on the outcome in the group exposed to the intervention.

There have recently been calls for this method to be more widely used in epidemiology [2•, 3] as well as attempts to frame DiD in the language of epidemiologic methods [4•]. However, there remain relatively few applications of DiD to epidemiologic problems [5] and a dearth of training materials targeted specifically towards epidemiologists and medical researchers.

Recently, our group, composed of epidemiologists and clinical researchers, collaborated on a study using DiD to estimate the impact of a pre-operative device briefing tool on surgical team performance in departments where surgeons had the opportunity to use this tool [6]. Surgical quality was measured with an ordinal outcome which summarized the score on an observer-recorded survey. Although this initially seemed like a straightforward use case for DiD, we rapidly discovered a challenge: despite the widespread use of DiD in the econometrics literature, we were unable to find guidance or examples on the use of DiD for bounded, categorical, or count outcomes.

In retrospect, this is unsurprising. Epidemiologic data most often deals with binary outcomes, such as case versus control, or alive versus dead, but categorical outcomes are also frequently of interest. These may include outcomes such as our survey score variable, or similar survey outcomes such as pain level, as well as count outcomes such as numbered events, such as number of migraines experienced in a month. In contrast, these types of outcomes are less common in econometrics, where the primary outcomes are often continuous variables such as costs. Furthermore, even when the outcome is binary, econometricians commonly model the outcome with a linear probability model. As a result, there has been little need for econometrics to address the question of how to handle categorical outcomes in DiD models.

In this tutorial paper, we describe the basics of the DiD method, walk through the assumptions required to apply it to a research question, and explore the different options for implementing DiD in data with categorical outcomes. We provide synthetic data and code in SAS, Stata, and R along with an explanation of the steps in the DiD modeling process, including sensitivity checks and assessment of modeling assumptions.

Getting to Know the Data

Our tutorial is based on synthetic data created to mirror the real data collected in a study by an academic medical center in collaboration with Johnson & Johnson [6]. In that study, the investigators aimed to test the impact of a Device Briefing Tool (DBT), a short communication instrument designed to promote discussion of safe device use among members of surgical teams to improve surgical safety. The DBT was implemented in four general surgery departments (colorectal surgery, upper gastrointestinal surgery, hepato-pancreato-biliary surgery, and acute care surgery) in a large academic referral center. Four additional surgical departments (cardiothoracic surgery, head and neck surgery, gynecology, and urology) were enrolled as a comparator group. The impact of the DBT on non-technical skills was assessed using the NOTECHS behavioral marker system. The NOTECHS system evaluates four behaviors across three surgical sub-scores, with total possible scores ranging from 4 to 48 points. Trained observers rated teams in live operations before and after implementation of the DBT. Observers also recorded features of each operation such as case complexity (rated 1–3, least challenging to most challenging) and type of surgical device in use (tissue sealing device, surgical stapler, or other). However, baseline observations were interrupted by the COVID-19 pandemic. Therefore, this study had three distinct time periods: pre-COVID baseline, post-COVID baseline, and post-intervention.

We created the synthetic dataset in SAS using the empirical distribution of covariates and NOTECHS scores observed in the real data. The use of a synthetic dataset allows us to share the data publicly, without concerns about violating patient confidentiality. However, it is important to remember that since the data in this tutorial are simulated, any conclusions from the analysis of this tutorial dataset should not be taken as evidence for or against real world causal effects.

Although the original study only included 210 surgeries distributed across 8 simulated departments, we created a larger simulated dataset with 20,000 simulated surgeries reflecting the distributions observed in the original study [6]. Simulated operations were then assigned to a time period and intervention group, based on the three periods of observation, and using a pseudo-linear time scale in months with t = 0 denoting the post-COVID baseline, t = 1 denoting the post-intervention period, and t = −12 denoting the pre-COVID baseline period. The covariates case complexity (case_challenge) and type of surgical device (device) were simulated by drawing a random uniform number and assigning the complexity or device category based on the empirical distribution of these variables in the original dataset. Finally, dummy variables for department and case complexity and their interactions were created. We modeled the expected average NOTECHS score in the actual data as a function of these dummy variables using a log-linear regression model and then generated synthetic NOTECHS scores for our simulated data by exponentiating the linear predictor for each surgery from the estimated regression coefficients. Scores were then rounded to the nearest whole number using the round (<>, 0) function in SAS, and truncated at the minimum and maximum possible values of 4 and 48. Code for simulating the data is provided in the Online Supplement (Code Section A1, SAS code only), as well as data files in SAS (did_sim.sas7bdat), Stata (did_sim.dta), and comma-separated (did_sim.csv) formats.

Defining the Causal Question

DiD answers a causal question which may be unfamiliar to some epidemiologists and medical researchers — rather than the more commonly used average treatment effect (or ATE), DiD provides an estimate of the causal effect of the change in treatment status for the group which actually experienced that change (Fig. 1). That is, it estimates the average treatment effect among the treated (often called the ATT). In the context of our example, the ATT can be described as the impact on surgical performance of the DBT for those operations which occurred in departments where the DBT was available for use.

Fig. 1
figure 1

Potential contrasts which could be estimated. The semi-circles represent study groups; shading represents (real or hypothetical) assigned conditions: gray = intervention; white = control. Line type represents observed (solid) or counterfactual (dashed) estimates. When observed outcomes in each group are compared, based on their actual assigned conditions, this provides an estimate of the observed association. The average treatment effect (ATE) estimates what would have happened if everyone had been assigned to each of the intervention and control conditions. The average treatment effect in the treated (ATT) estimates the difference between what happened to the intervention group under their assigned condition, and what would have happened to the intervention group if they had been under the control condition

This causal question may not seem much different to the general reader, but it has important implications for our required causal assumptions — namely, we must have a control group whose trend over time provides a valid counterfactual for what the experience of the intervention group would have been if they had not received the intervention. But we do not require the reverse (i.e., we do not require that the intervention group’s experience reflect what would have happened to the control group had they been given the intervention).

Causal Assumptions

Table 1 outlines the assumptions required for validity of our causal effect estimate from the DiD analysis. These assumptions have been described in more detail for epidemiologic audiences elsewhere [4•], but we briefly review them here.

Table 1 Difference-in-differences assumptions

First, we require the assumption of causal consistency, which can be expressed as the requirement that we are asking a well-defined causal question. Mathematically, this assumption requires that the intervention we have in the data is the one that we intended to assess with our counterfactual comparison, such that the observed outcome under intervention is equal to the counterfactual outcome under that same intervention. In the context of our tutorial example, consistency is expected to hold at the level of being trained on the DBT, since the same tool was given to all surgeons in the intervention departments and all surgeons in those departments were provided with tool training. However, surgeons had discretion about choosing to use the DBT or not, as well as there being potential variation in how the DBT was used. In order for the consistency assumption to hold, we must further assume that the reasons a surgeon chose to use or not use the DBT, and the various ways in which the DBT was used in practice, did not alter the causal effect of the DBT’s use on surgical quality as measured by NOTECHS score.

Second, we require positivity, sometimes called “overlap,” such that all departments in the study had a non-zero probability of being selected for either the intervention or control group. Since this was an intervention study, positivity is expected by design — even though the intervention departments were selected as units, based on shared nursing load, it was possible, pre-assignment, that either set of departments could have been assigned to the control or intervention group.

Finally, we need a version of the causal exchangeability assumption — the parallel trends assumption. This assumption specifies that the trend in the outcome over time in the control group is parallel to the trend in the outcome over time that would have been observed in the intervention group, had the intervention group not received the intervention. Changes in trends of covariates over time might indicate a potential for the outcome trend to change over time but are not in themselves violations of the exchangeability assumption. If this assumption holds, we can use the control group’s trend to estimate the counterfactual NOTECHS scores for the intervention group. Figure 2 displays the assumed pattern of actual and counterfactual trends that the difference-in-differences method relies upon.

Fig. 2
figure 2

Difference-in-differences example graph. The two lines represent the trends over time in outcome measures for the intervention (Z = 1, orange) group and the control (Z = 0, blue) group. The x-axis represents time in months, from t = − 12 months (pre-COVID baseline) to t = 1 (end of follow-up). The vertical line indicates the switch from pre-intervention (T = 0) to post-intervention (T = 1). The parallel trends assumption relies on the idea that if the intervention group had not received the intervention, the average outcome in that group would have continued to follow the trend observed in the control group (and in the intervention group, pre-intervention). This counterfactual trend is indicated with the gray line. The difference-in-differences estimates the outcome change between the counterfactual intervention line (gray) and the actual intervention line (orange) between baseline (t = 0) and the end of follow-up (t = 1)

The parallel trends assumption can never be empirically verified, but we can assess how reasonable this assumption is by looking at the pre-intervention time-trends in the control and intervention group. Although a DiD can be estimated with only two time points — one each before and after the intervention — having a third time point allows quantitative assessment of the parallel trend assumption. In studies with only one pre-intervention time-point, subject matter knowledge must be used to support the validity of this assumption. In the tutorial code, the first section (Code Section 1.1) estimates the change in NOTECHS scores between the pre- and post-COVID baseline periods for the control and intervention departments and calculates the difference-in-differences for this contrast. The results, displayed in Table 2, show that there is very little difference in the pre-intervention trends between the two groups (trend pre-/post-COVID differs by 0.4 points between groups), providing support for (but not proof of) the parallel trends assumption.

Table 2 Parallel trends evaluation: observed average outcome score in the synthetic dataset comparing the two baseline periods (t≤0; N = 9951) and assignment group

Difference-in-Differences Estimator

Now that we have discussed the data, questions, and causal assumptions required, we briefly review the estimator. As described above, the DiD estimator, 𝛿DiD, is a comparison between two average counterfactuals: the expected counterfactual outcome for the intervention group after the intervention time point had the intervention been delivered, and the expected counterfactual outcome for the intervention group after the intervention time point had the intervention not been delivered. Figure 3 summarizes the relationships between group assignment, intervention application, covariates, and our outcome variable over time in a directed acyclic graph (DAG).

Fig. 3
figure 3

Directed acyclic graph (DAG) representing the assumed data structure for the tutorial example. Yt is the average NOTECHS score at time t, with t = −12 at the pre-COVID baseline, t = 0 at the post-COVID (main) baseline, and t = 1 at follow-up. Z is intervention group assignment; Z = 1 for departments assigned to the DBT, and Z = 0 for control departments. NOTECHS score is expected to vary between departments due to several factors, including the typical complexity of surgeries within a department, and the typical devices used within the department, as well as due to shared nursing pools between departments within a given group assignment (hence the department --> Z arrow). The specific device type and case complexity of a given surgery will also be an independent determinant of NOTECHS score for a given surgery. While NOTECHS score is measured at three time points, each component observation comes from a different surgery for which the device types used and case complexity are time-invariant characteristics of that surgery, unaffected by the characteristics of prior surgeries

We can represent the DiD estimator in expectation notation with Equation (1) (Table 3):

$${\delta}_{\mathrm{DiD}}=\left(E\left(\left.{Y}_{t=1}^{z=1}\right|Z=1\right)-E\left(\left.{Y}_{t=0}^{z=1}\right|\mathrm{Z}=1\right)\right)-\left(E\left(\left.{Y}_{t=1}^{z=0}\right|\mathrm{Z}=1\right)-E\left(\left.{Y}_{t=0}^{z=0}\right|Z=1\right)\right)=E\left(\left.\Delta {Y}^{\mathrm{z}=1}\right|Z=1\right)-E\left(\left.\Delta {Y}^{\mathrm{z}=0}\right|Z=1\right)$$
(1)

where Z indicates group assignment (Z = 1 for intervention, Z = 0 for control), T indicates time period (T = 1 for post-intervention, T = 0 for baseline). Y represents the outcome such that Y indicates observed NOTECHS score, Yz indicates NOTECHS score counterfactual on intervention status, Yt indicates observed NOTECHS score at a particular time point, \({Y}_t^{\mathrm{z}}\) indicates the counterfactual NOTECHS score at a particular time point. ∆Y & ∆Yz are the change in outcome and counterfactual between the baseline and follow-up time points.

Table 3 Notation

To understand how the choice of modeling assumptions impact the estimated difference-in-differences value, it is useful to walk through the estimation of 𝛿DiD. Under the assumptions described in Table 1, we can estimate the two pieces of 𝛿DiD from our synthetic data as follows.

If consistency holds, the average potential outcome scores for the intervention group after intervention, E(\({Y}_{\mathrm{t}=1}^{\mathrm{z}=1}\)|Z = 1), and before the intervention, E(\({Y}_{\mathrm{t}=0}^{\mathrm{z}=1}\)|Z = 1), can be estimated directly from the observed average scores in the intervention group, E(Y|Z = 1, T = 1) and E(Y|Z = 1, T = 1), or the observed change, E(∆Y|Z = 1).

Under the assumption of consistency, the observed outcomes under no intervention in the control group are direct estimates of the potential outcomes under no intervention for the control group. In contrast, the potential outcome at follow-up under no intervention is unobservable for the intervention group. However, the parallel trends assumption, if true, allows us to estimate the counterfactual outcomes under no intervention for the intervention group. If we assume that the (counterfactual) trends over time, under no intervention, are parallel between the two groups, then the counterfactual average change in outcome that the intervention group would have had, between baseline and follow-up time, if they had not received the intervention, can be estimated from the observed change over time in the control group, E(∆Y|Z = 0). The parallel trends assumption is equivalent to assuming that E(∆Yz=0|Z = 1) = E(∆Yz=0|Z = 0), while consistency tells us that E(∆Yz=0|Z = 0) = E(∆Y|Z = 0).

Given these assumptions, we can estimate \(\hat{\delta}\)DiD as E(∆Y|Z = 1) − E(∆Y|Z = 0). This can be done via hand calculation, but since we are also interested in adjusting for covariates, we present code to calculate the unadjusted estimator using the regression equation:

$$E\left(\left.Y\right|Z,T,T\ge 0\right)={\beta}_0+{\beta}_1I\left(T=1\right)+{\beta}_2Z+{\beta}_3I\left(T=1\right)Z$$
(2)

In this equation, I(T = 1) represents an indicator variable for pre- versus post-intervention. When the data are restricted to T ≥ 0, the regression model estimates four groups: control group pre-intervention; control group post-intervention; intervention group pre-intervention; and intervention group post-intervention. Using this regression model, we can substitute the four values of T and Z of interest into the linear predictor to derive the difference-in-differences estimator. We see that in this case, the estimator is simply the regression coefficient for the interaction between intervention status (time) and group.

$${\widehat\delta}_\mathrm{DiD}=\left[E\left(\left.Y\right|Z=1,T=1\right)-E\left(\left.Y\right|Z=1,T=0\right)\right]-\left[E\left(\left.Y\right|Z=0,T=1\right)-E\left(\left.Y\right|Z=0,T=0\right)\right]=\left[\left(\beta_0+\beta_1(1)+\beta_2(1)+\beta_3(1)(1)\right)-\left(\beta_0+\beta_1(0)+\beta_2(1)+\beta_3(0)(1)\right)\right]-\left[\left(\beta_0+\beta_1(1)+\beta_2(0)+\beta_3(1)(0)\right)-\left(\beta_0+\beta_1(0)+\beta_2(0)+\beta_3(0)(0)\right)\right]=\left[\left(\beta_0+\beta_1+\beta_2+\beta_3\right)-\left(\beta_0+\beta_2\right)\right]-\left[\left(\beta_0+\beta_1\right)-\left(\beta_0\right)\right]=\beta_3$$
(3)

Code Section 2.1 demonstrates the estimation of the unadjusted difference-in-differences using Equation (2). The results are displayed in Table 4. We can see that this model estimates a roughly 1.7-point decrease in NOTECHS score in the intervention group, after receiving the DBT. We note here that since the intervention assignment was made at the department level, we should use robust (sandwich) estimators to obtain the variance accounting for any clustering by department. Alternatively, the variance or confidence interval can be obtained via bootstrapping.

Table 4 Difference-in-differences from simple crude linear model, discarding pre-COVID baseline time-point (t≥0, N = 15,086) (Equation (2))

Modeling Assumptions: Time Trends

The model in Equation (2) makes some simplifying statistical assumptions, in addition to the causal inference assumptions. First, this model discards information from the pre-COVID baseline period (t=−12). However, we can potentially improve the estimation of the counterfactual trend line by including the data from this time point, but doing so requires making assumptions about the appropriate shape of the trend-line. Since we are making the parallel trends assumption, and have only three time points, here it seems reasonable to assume that the trend is linear, and update the regression equation to include time and data from the pre-COVID baseline period (t=−12):

$$E\left(\left.Y\right|Z,T\right)={\beta}_0+{\beta}_1I\left(T=1\right)+{\beta}_2Z+{\beta}_3I\left(T=1\right)Z+{\beta}_4\mathrm{t}$$
(4a)
$$={\beta}_{0,\mathrm{t}}+{\beta}_1I\left(T=1\right)+{\beta}_2Z+{\beta}_3I\left(T=1\right)Z$$
(4b)

Code Section 2.2 demonstrates the estimation of the unadjusted difference-in-differences using Equation (4a). Note that time is not expected to interact with our effect, and so we can also consider the coefficient for time as a component of the slope, as in Equation (4b). With this modification, it is clear that our estimator is still obtained via the coefficient on the I(T = 1)Z interaction term as in Equation (3) (i.e., \(\hat{\delta}\)DiD = \(\hat{\beta}\)3).

The results are displayed in row 1 of Table 5. This model estimates a similar, but slightly smaller, 1.5-point decrease in score in the intervention group, after receiving the DBT. As before, we note that since the intervention assignment was made at the department level, we should use robust (sandwich) estimators or bootstrapping. The confidence intervals in Table 5 were obtained using 500 non-parametric bootstrap samples.

Table 5 Summary of marginal effect estimates and predicted mean outcome scores using the full dataset, including pre-COVID baseline. 95% confidence intervals from empirical distributions of results across 500 bootstrap samples

Modeling Assumption: Outcome Distribution

The second modeling assumption we have made in Equations (2) and (4) is a linear link and normal residual distribution between our independent and dependent variables. However, as we described above, the NOTECHS score is an integer bounded between 4 and 48, and it may be more appropriate to assume log-link and a Poisson error distribution. This would result in the following regression model:

$$\mathrm{logE}\left(\left.Y\right|Z,T\right)={\alpha}_{0,\mathrm{t}}+{\alpha}_1 AI\left(T=1\right)+{\alpha}_2Z+{\alpha}_3I\left(T=1\right)Z$$
(5)

In Equation (5), we use the Greek letter alpha to clarify that the coefficients on the variables will have different values from those estimated with Equations (2) and (4). Now, the regression model estimates the natural logarithm of the expected NOTECHS score, and we can no longer use the coefficient on the interaction term as the estimator of our causal effect. However, we can use the same process as in Equation (3) to derive our estimator, recognizing that the estimated expected value of the NOTECHS score is obtained by exponentiating the linear predictor:

$$E\left(\hat{Y\mid Z},T\right)={e}^{\alpha_{0t}+{\alpha}_1\mathrm{I}\left(\mathrm{T}=1\right)+{\alpha}_2Z+{\alpha}_3\mathrm{I}\left(\mathrm{T}=1\right)Z}$$
(6)

Now, plugging the appropriate values of A and Z into our estimator from Equation (3) gives us:

$${\widehat\delta}_\mathrm{DiD}=\left[E\left(\left.Y\right|Z=1,T=1\right)-E\left(\left.Y\right|Z=1,T=0\right)\right]-\left[E\left(\left.Y\right|Z=0,T=1\right)-E\left(\left.Y\right|Z=0,T=0\right)\right]=\left[\left(e^{\alpha_{0t}+\alpha_1+\alpha_2+\alpha_3}\right)-\left(e^{\alpha_{0t}+\alpha_2}\right)\right]-\left[\left(e^{\alpha_{0t}+\alpha_1}\right)-\left(e^{\alpha_{0t}}\right)\right]$$
(7)

The rules of exponents mean that Equation (7) does not readily simplify. Instead, we must compute each expected value and then calculate the estimate, as in Code Section 2.3. Note that, in SAS, this necessitates the use of bootstraps to obtain valid confidence intervals (Supplementary Code Section A2, SAS code only), while in Stata, careful application of the margins command can provide robust variance estimates.

The results of the analysis using Equations (6) and (7) are presented in row 2 of Table 5. Comparing the estimates in Table 5, we see that the choice of modeling assumptions does not affect our effect estimate. Modeling the time trend with additional data did slightly attenuate our effect estimate, but changing the regression model link and error distribution had essentially no effect on our point estimate.

We can reconcile this by understanding that even if our categorical or count outcome may be better represented by a Poisson error distribution, the outcome of interest for the difference-in-differences estimator is actually the change in the outcome (∆Y). The distribution of the difference of two Poisson distributed variables is not itself expected to be Poisson-distributed, since the value is now bounded by a minimum change of −44 points and a maximum of +44 points. Instead, the change outcome is expected to follow a Skellam distribution, which describes the difference between two Poisson-distributed variables [7]. When the average value of the two Poisson-distributed variables is similar, the resulting Skellam distribution will be approximately normal (although with a sharp peak). In addition, the distribution of interest for the regression model is not the distribution of the outcome per se but rather the distribution of the residuals or error terms. Since our true outcome may be expected to be approximately (sharply) normally distributed, it is reasonable to assume that the residuals will also be approximately normal, and we can use a linear regression model.

As we can see in our synthetic data, and as was also the case in sensitivity analyses on the original study data, linear regression can be an appropriate choice for modeling the DiD estimator even when the outcome data are categorical, bounded, or count data since the target for estimation is the change in outcome. This result has also been described by the economist Dr Jeff Wooldridge [8,9,10] but does not appear to be widely known.

Selecting Confounders

The final step in modeling our DiD estimator is to consider whether any confounders are needed to meet the parallel trends assumption. The parallel trends assumption is a weaker assumption than the exchangeability assumption more commonly used in epidemiologic research, and it reduces the types of variables which are relevant to consider for confounding control. Specifically, the parallel trends assumption is not violated by any group-specific confounders which remain at a constant frequency over time. Instead, the confounders of interest for a difference-in-differences analysis are those variables which differ between groups, impact the outcome of interest, and for which the frequency changes over time in different ways between the groups. As a result, such variables are potentially alternative causes of the change in the outcome.

In our tutorial example, there are two potential covariates to consider: device type and case complexity. Device type and case complexity both vary between departments. In addition, because surgeons in intervention departments were provided the opportunity to choose which surgeries to use the DBT in and only these surgeries had recorded post-intervention DBT group NOTECHS scores, both device type and case complexity may vary before and after the intervention within the intervention departments. As a result, we should consider both device type and case complexity as potential confounders. Note that the average value of these variables across surgeries within a department are time-varying, but our data are collected at the level of surgeries, and for each individual surgery, the device type and case complexity are time-invariant.

The dataset also contains a third covariate: department. Although NOTECHS score may vary by department, department is not a confounding variable. This is because the group assignment was made at the department level and thus all operations within a department share group assignment. It is important to note, however, that if we had assigned the intervention to individual operations, we might need to consider confounding by department because the number of surgeries in each department does change over time.

Estimating Conditional Adjusted Difference-in-Differences

Adjustment for confounders requires using a multivariable regression model, but as we shall see, such a model is only the first step in obtaining an adjusted DiD estimate. Since our initial model-building process indicated that a linear regression model was reasonable, we focus the discussion of adjustment on linear models, Code Section 3. Equations for adjusted Poisson regression and DiD estimation are presented in the Supplementary Materials and in Supplementary Code Section A3. Note that as with the unadjusted Poisson estimator, the conditional adjusted Poisson estimator cannot be read directly off the model output.

For the adjusted linear regression, we model the expected NOTECHS value conditional on the covariates (Equation (8)). This model strongly resembles that of the crude linear analysis (despite the use of a new Greek letter, gamma, for the coefficients), except for the addition of a new term: γ4’L. This term is bolded and has an apostrophe after the coefficient to indicate that rather than representing a single variable and its single coefficient the term γ4’L represents a sequence (or matrix) of covariate-coefficient product terms added together for as many confounders (and potentially their interactions with each other) as are to be included in the model. This short hand makes the derivation of the DiD estimator simpler, without impacting its validity. Code for the regression in Equation (8) is found in Code Section 3.1.

$$E\left(\left.Y\right|Z,T,L\right)={\gamma}_{0\mathrm{t}}+{\gamma}_1I\left(T=1\right)+{\gamma}_2Z+{\gamma}_3I\left(T=1\right)Z+{{\boldsymbol{\gamma}}_{\textbf{4}}}^{'}\boldsymbol{L}$$
(8)

As with the unadjusted linear model, we can compute a DiD estimator from the coefficient on the I(T = 1)Z interaction term (Equation (9)). However, because our adjusted regression model estimates the conditional outcome for each T-Z grouping, E(Y|Z,T,L) not the unconditional outcome, E(Y|Z,T), this DiD estimator in Equation (9), \(\hat{\updelta}\)did|L, is not necessarily the same as the one estimated in Equation (3), \(\hat{\updelta}\)did.

$${\widehat\delta}_{\left.\mathrm{did}\right|\mathbf L}=\left[E\left(\left.Y\right|Z=1,T=1,L=\boldsymbol\ell\right)-E(\left.Y\right|Z=1,T=0,L=\boldsymbol\ell)\right]-\left[E\left(\left.Y\right|Z=0,T=1,L=\boldsymbol\ell\right)-E (\left.Y\right|Z=0,T=0,L=\boldsymbol\ell)\right]=\left[\left(\gamma_{0\mathrm t}+\gamma_1+\gamma_2+\gamma_3+{\boldsymbol\gamma}_{\mathbf4}\boldsymbol'\boldsymbol\ell\right)-\left(\gamma_{0\mathrm t}+\gamma_2+{\boldsymbol\gamma}_{\mathbf4}\boldsymbol'\boldsymbol\ell\right)\right]-\left[\left(\gamma_{0\mathrm t}+\gamma_1+{\boldsymbol\gamma}_{\mathbf4}\boldsymbol'\boldsymbol\ell\right)-\left(\gamma_{0\mathrm t}+{\boldsymbol\gamma}_{\mathbf4}\boldsymbol'\boldsymbol\ell\right)\right]={\widehat\gamma}_3$$
(9)

If we are interested in the \(\hat{\updelta}\)did|L estimator, however, we can use the (robust) variance estimate directly from the regression output to obtain our variance and confidence interval around this value. Since this estimator is conditional on the value of the covariates, Table 6 presents the \(\hat{\updelta}\)did|L estimates for each joint case complexity — device stratum. The table also presents the simple average within device strata, and case complexity strata, and overall, for both the linear and the Poisson models (no weighting). We can see that the linear model returns the same estimate (\(\hat{\upgamma}\)3) for both the overall \(\hat{\updelta}\)did|L estimate and the stratum-specific estimates. This is because our data does not have any additive effect measure modification, so despite the use of a product term in the regression to allow for assessment of stratum-specific effects, we find no variation. On the other hand, the data were simulated with log-scale (multiplicative) effect measure modification between case complexity and device type; as a result, we see that the stratum-specific \(\hat{\updelta}\)did|L estimates from the Poisson model vary slightly between strata and differ from the overall estimate, holding all covariates constant. In addition, as we shall see in the next section, because of the multiplicative effect measure modification, the (unweighted) average \(\hat{\updelta}\)did|L estimate holding covariates constant differs slightly from the (weighted) average \(\hat{\delta}\)DiD estimate which accounts for the frequency of each stratum in the sample.

Table 6 Summary of stratum-specific conditional effect estimates using the full dataset, including pre-COVID baseline. The synthetic data has no additive effect measure modification, but does have log-scale (multiplicative) effect measure modification. Conditional estimates are equal between strata when no effect measure modification is present

Estimating the Unconditional (Marginal) Effect After Adjustment

It is likely, however, that despite the need to adjust for confounders, we may actually be interested in the marginal average treatment effect in the treated; that is, in \(\hat{\updelta}\)did not \(\hat{\updelta}\)did|L. When we are confident about a lack of effect modification, the conditional DiD estimate can provide this information as we have seen. However, we can also estimate the unconditional DiD estimator directly using the output of the above regression model without making any assumptions about whether or not effect modification exists. This process, called standardization in epidemiology and marginalization in econometrics, requires some brief explanation.

Epidemiologists are likely familiar with direct and indirect standardization, where an outcome is updated to estimate the outcome that would have been obtained if one or more key covariates had been distributed to match another population. Age-standardized mortality rates are a common example of this type of standardization. Here, we are interested in knowing what the DiD estimate would have been in this population if our confounders had been distributed randomly between the two groups. As such, we need to standardize the estimate to the distribution of the confounders in our study sample. When this type of standardization is used to estimate a causal effect, it is called the g-formula or g-computation [11] — although those terms are more often used to refer more complex implementations with time-varying exposures.

To obtain our standardized or marginalized estimate, we need to use the covariate distributions to estimate E(Y|Z,T) from the values of E(Y|Z,T,L) produced by our regression model. We could do this by hand using Equation (10):

$$E\left[Y\vert T,Z\right]=\sum_lE\left[Y\vert T,Z,L=l\right]\ast P\left(L=l\right)=\sum_lE\left[Y\vert T,Z,Case\;complexity,Device\right]\ast f\left(Case\;complexity,Device\right)$$
(10)

Since we only have two confounders, this would be fairly a reasonable approach, and code to obtain values of P(L = l) for all confounder strata is included in Code Section 3.2. However, hand-calculation can be time-consuming and, whenever any confounder is continuous or the number of confounders is large, may be essentially impossible.

A simpler approach is provided in Code Section 3.3. First, we make copies of our data such that there is one copy for each strata of E[Y|T,Z] that we need to compute; here, and in most DiD analyses, we have 4 T-Z groups, so we need 4 copies of the dataset. Remember that the pre-COVID time point is only for getting a better pre-intervention trend and not for estimating the DiD, so we must exclude surgeries in this time point from our copies. For each copy, we delete their outcome value, and replace their period and group assignment with one of the four T-Z combinations. This provides us with a new version of our dataset in which every individual has been assigned to every group of interest. Now, we can compute, for each copy of each observation, the linear predictor (E[Y|T,Z,L]) based on the covariate values of the original observation and the new T-Z assignment. Finally, we simply average the predicted value within each set of copies, such that, for example, the predictors for copies with T = 0 and Z = 0 are averaged to provide an estimate of E[Y|T = 0, Z = 0]. Since the predicted value of the outcome within T-Z levels will be the same for all individuals who have the same confounder values, this process is simply an automation of the weighted summation in Equation (10).

In Stata, this process has been automated in the margins package. However, one note of caution: since we have included time as a variable in our model, and since time varies in a deterministic way with the value of A, we must be careful in how we apply the margins command. Specifically, we need to exclude the pre-COVID baseline surgeries from the marginalization. Code Section 3.4 in Stata provides the appropriate margins command and selects only the relevant pieces of the command output for our example.

Finally, we can use our standardized predicted outcomes to compute the desired estimator, \(\hat{\updelta}\)did, using Equation (3). Due to the multistep nature of this process, we will need to use bootstraps to obtain valid confidence intervals in SAS (Supplementary Code Section A2, SAS code only), while in Stata, the margins command uses the Delta method to obtain a valid variance that accounts for the regression and standardization components. The R code uses robust variance estimation, but we encourage researchers to consider the use of bootstrapping to obtain valid, non-conservative confidence intervals.

Table 5 row 3 reports the estimated values of \(\hat{\updelta}\)did obtained from the modeling procedure described above, as well as for the adjusted Poisson regression model (Table 5 row 4) described in the Supplementary Materials. Compared to the unadjusted analyses, we can see that there is a small amount of confounding bias away from the null which is reduced in our adjusted model, resulting in a slightly less negative effect estimate in the adjusted analyses (rows 3–4). Second, when there is no effect measure modification on the scale of the model by any of the covariates, we expect that the marginal and conditional estimates will be equal, and this is what we see in the data with the conditional DiD estimates (Table 6). Our synthetic data was created with effect measure modification on the log scale (i.e., multiplicative effect measure modification) but not on the linear scale (i.e., no additive effect measure modification), and therefore the marginal and conditional estimates differ when using adjusted Poisson models (row 4) but not when using adjusted linear models (row 3).

Summary

Overall, using the synthetic dataset, we estimate that the effect of the DBT intervention on the intervention departments was to decrease their average NOTECHS score by about 1.2 points on a scale from 4 to 48 (95% confidence interval: −1.3, −1.16). If this were the real data, we might conclude that the DBT slightly decreased surgical performance immediately following adoption and recommend additional training. However, despite significant result, a change in 1.2 points on a 44-point scale is quite small, so we might also conclude that the DBT did not meaningfully alter surgical quality in the month immediately following introduction of the DBT in the intervention group. It is important to remember that our synthetic data is not the real data — the original study found a null result and concluded the DBT had no short-term impact on surgical quality in the first month of use [6].

This manuscript is intended to provide a worked example of the estimation of the average causal effect of treatment among the treated using the difference-in-differences methods when applied to epidemiologic or medical data. All data and code used in this tutorial are available online at www.github.com/eleanormurray/DiDTutorial and as Online Supplementary Materials accompanying this article. The GitHub repository also includes Student and Teacher handouts, and the version at the time of publication acceptance are archived on Zenodo at https://doi.org/10.5281/zenodo.8178959.