Skip to main content

Indirect effects and causal inference: reconsidering regression discontinuity

Abstract

Causal inference models, like regression discontinuity (RD) design, rely upon some variation of the no-interference assumption, where peer effects or spatial spillovers are null. Given the increased application of network, spatial, and peer effects models, this paper reconsiders RD design when this assumption is not satisfied, yielding indirect effects of the treatment in addition to the traditionally measured direct effects. Using a combination of residualization and numeric integration we develop a method—using the Spatial Durbin Framework—which retains the full adjacency matrix and allows for a full accounting of these cross-sectional interactions. As an application, we revisit a well-known RD design using U.S. House of Representatives election results from 1945–1995, finding close election wins have substantial indirect effects which previously were unaccounted.

Introduction

Regression discontinuity (RD) design has, over the previous two decades, become a mainstay of the empirical economics research toolbox (see Lee and Lemieux 2010 for a survey). Due to the difficulty, both in cost and ethical considerations, of randomized control trials (RCTs) economists have turned to this quasi-experimental design method to better understand the [causal] impacts of policy changes. First introduced by Thistlethwaite and Campbell (1960), RD has been used to identify the effect of passing school levies on new residential construction (Brasington 2017), school-board composition on student segregation (Macartney and Singleton 2017), implementation of gifted or high achieving classrooms on minority student achievement (Card and Giuliano 2016), traffic congestion following a cessation of public transit services (Anderson 2014), and class size on student achievement (Angrist and Lavy 1999), among many other topics. Peer effects, that is the impact changing one’s own characteristics has on neighboring outcomes, is quickly becoming a topic of great interest with respect to causal inference (Athey and Imbens 2017; Kolak and Anselin 2020). The importance of these indirect effects is made clear through violations of the no-interference assumption (see Imbens and Rubin 2015 for a discussion) which is critical to the potential outcome framework commonly used in economic literature. This paper reconsiders RD design when this assumption is not satisfied, yielding indirect effects in addition to the traditionally measured direct effects.

RD exploits the common use of pre-specified treatment rules which are linked—through a threshold of eligibility—to some continuous measurement (running variable) in order to estimate the direct effect of treatment.Footnote 1 Since agents cannot precisely control, or manipulate, their relative position around the threshold they can be thought of as pseudo-randomized. This randomization produces—at least in theory—comparison groups which differ in expectation only with respect to the treatment effect (Rubin 1978). The RD framework has been shown to provide close approximations to RCT results in many cases (Cook and Wong 2008) and thus has become omnipresent in applied economics literature. Examples of continuous measurements used in determining eligibility for treatment include: vote share (Hainmueller and Kern 2008; Pettersson-Lidbom 2008; Eggers et al. 2015), test scores and academic ability (Thistlethwaite and Campbell 1960; Van der Klaauw 2002; Jacob and Lefgren 2004), and poverty rates (Ludwig and Miller 2007; Meng 2013), among others.

While the RD framework has turned into a nearly ubiquitous structure from which to draw causal statements it is not without its weaknesses; particularly the cross-sectional dependence commonly found in spatial and network models. It is well known that, in the face of such cross-sectional dependence, parameter estimates are potentially both biased and inefficient (Anselin 1988; LeSage and Pace 2009; Pace et al. 2011). Additionally, it has been shown that common responses to this dependence, e.g. fixed effects and/or clustered standard errors, can exacerbate model misspecification issues including bias, efficiency, and spuriousness of results (Anselin and Arribas-Bel 2013). The primary issue with these types of models in an RD context is a clear violation of the no-interference assumption which underlies much of the potential outcome framework upon which RD is based. In this paper we extend the local-linear RD to a generalized network RD framework—one which nests both network and non-network base specifications as a special case—using a combination of Bayesian sampling methods, residualization, and numeric integration. Following LeSage and Pace (2017) we use a Monte Carlo study to show the resulting model specification produces estimates with lower bias and mean-square-error (MSE) relative to its peers, even in the nested, network free case. Moreover, the approach contained herein allows for a full examination of both partial and cross-partial derivatives common to both spatial and network models (LeSage and Pace 2009).

These cross-partial derivatives, known as indirect effects, can be sizable depending on the strength of cross-sectional dependence. In a standard RD framework it is assumed that these cross-partial derivatives are equal to zero and that the total and direct effects of the treatment are equal. Put bluntly, this assumption prohibits the treatment from spilling over to other treated or non-treated units (see Kolak and Anselin 2020 for a discussion). While this may be a suitable assumption in some limited RCT settings—especially those with strict compliance protocols—in a quasi-experimental framework it is overly restrictive.

Let us abstract away from the RD framework for a moment in an effort to build some intuition. Consider the following example: we have a class of students who are scheduled to take an exam. We would like to know what impact, if any a tutor would have on the students’ final scores. One way this can be done is to randomize which students get access to the treatment; the tutor in this example. As the students enter the classroom we might flip a fair coin and assign a tutor to all students who flip heads. For simplicity assume these tutors are homogeneous in their quality and that the treated students fully comply with the tutor and the tutor’s methods. After the exam we compare the scores of treated and non-treated students to estimate the average treatment effect (ATE).

Is the ATE measuring only the effect of tutors on test scores? Given the information outlined above, the answer is no. Cross-sectional dependence is introduced through a number of channels, though the most obvious in retrospect is the students ability to cheat off of one another. An untreated student’s test score, in the presence of cheating, depends upon the test score of the treated and untreated students who sat next to him, a violation of the no-interference assumption. In addition to cheating, students may also form outside study groups comprised of both treated and untreated students. The interaction within these groups confers some of the benefits from the tutor on those that were not selected for treatment. The first channel is a relatively easy one to fix, we could just produce a different exam for every student and thus prevent them from cheating. Alternatively we could have students take the exam in isolation.Footnote 2 The second channel is a bit harder to deal with though it is not clear we would want to prevent it. Students studying is a good thing!

How does our RCT example, flawed as it may be, shed light on the issue of cross-sectional dependence in quasi-experimental frameworks such as RD? The data we use in RDs are not collected in an RCT setting. As a result, we may not know which cross-sectional dependence inducing channels have been closed. For example, instead of assigning treatment through a random mechanism we could assign the services of a tutor based upon a student’s previous test score with some cutoff denoting eligibility. Any value drawn at or below the cutoff is treated while any above the cutoff is untreated. This is a classic RD setup in which we could examine, within a certain bandwidth, students around the cutoff and calculate a local average treatment effect (LATE). The assumption of course being that students one point below the cut-off are the same as those one point above the cut-off in all aspects other than the treatment. Without knowing if the aforementioned channels have been closed (e.g. Did the professor randomize exams?) we have left ourselves open to bias, inefficiency, and potential spurious results by assuming those channels had been closed. Moreover, as is often the case, we may be evaluating the impact of tutors specifically so that we can make a policy recommendation on public subsidy of tutors. If we do not fully account for the cross-sectional dependence we may overstate (or understate—depending on the direction of the indirect effects) the effectiveness of tutors on test scores, recommending a subsidy which is too large (or too small) and thus create inefficiencies in public financing.

One can think of a number of situations in which we would expect cross-sectional dependence to exist in causal studies. First, consider the significant evidence pointing to strong cross-sectional dependence in housing markets (Kim and Goldsmith 2009; Bin et al. 2011; Mihaescu and vom Hofe 2012; Wong et al. 2013; Lazrak et al. 2014; vom Hofe et al. 2019) where prices are dependent not only upon the characteristics of one’s own home, but also the price and characteristics of neighboring homes. It follows then that any treatment which affects the price of homes would have an effect on surrounding (treated and untreated homes). Hidano et al. (2015) use an RD framework to examine how buyers in Tokyo evaluate seismic risk via the price premium on the property. They are able to show that properties in a low-risk zone are at a price premium relative to those that are outside of the low-risk zone. While their work acknowledges the presence of cross-sectional dependence, and utilizes a spatial hedonic model in the form of Kelejian and Prucha (1998, 1999) to support their main conclusions, they do not incorporate the dependence into the RD framework itself and admit “...accounting for such cross-sectional interactions in a quasi-experimental framework is challenging and an interesting topic for future research (p. 121)”. A similar example comes in the form of Moulton et al. (2016) which examines the benefits of targeted property tax relief measures in Virginia, using an RD framework, and finds that increased demand stemming from the measure lead to a 5% increase in home values. Again, acknowledging the possibility of these cross-sectional interactions, they employ a spatial fixed effects specification as a robustness check.

Second, and perhaps more well known to most applied economists, is the use of RD to tease out measures of incumbency advantage. Several studies have shown that vote share in the current election is positively impacted by close wins, as measured by margin of victory, in previous elections (see Lee and Lemieux 2010; Chib and Greenberg 2014; Cattaneo et al. 2015 for example). Meanwhile, there is evidence that voting exhibits positive cross-sectional dependence and thus would be exposed to indirect effects (Kim et al. 2003; Cho and Rudolph 2008; Cutts and Webber 2010; Lacombe et al. 2014). We can think of several channels by which this cross-sectional dependence may manifest. First, while elections are district specific, the local media coverage area my overlap with multiple districts either in part or whole. Political advertisements would thus have an impact over multiple constituencies, even if those advertisements are for a particular candidate. Moreover, get out the vote initiatives by local political parties may cross district boundaries as they push for voter turnout in the area more generally rather than in a specific district. This is particularly relevant in situations where several tiers of government may be having simultaneous elections (e.g. House and Senate races). We find it interesting that use of RD in this context has become a bit controversial as researchers question if the basic assumptions of RD are being met in the aforementioned studies (see Caughey and Sekhon 2011; Eggers et al. 2015; De la Cuesta and Imai 2016 for a discussion). Yet, none of these criticisms take into account the potential for cross-sectional dependence which would violate the Stable Unit Treatment Value Assumption (Rosenbaum and Rubin 1983).

By generalizing the RD framework we are able to overcome some of these limitations. We limit our generalization to the local-linear framework for two reasons. First, recent research has shown that RD models using higher-order polynomials (e.g. third or fourth degree) in the running variable produce poor results in practice with noisy point estimates and confidence intervals with ill defined coverage Gelman and Imbens (2018). Second, and perhaps more importantly, we find that interval construction in the local-linear method has a simple, clean form in a Bayesian context with bandwidth selection—though important for identification purposes—playing a marginal role. Since the distribution of the dependence parameter is of unknown form, we utilize numeric integration to include uncertainty about both initial parameters and the bandwidth in our estimate of the local average treatment effect (LATE). In each iteration we are recalculating the bandwidth based on our uncertainty about the residualization parameters. This creates a frontier of marginal observations which move in and out of the estimation sample throughout the algorithm. For computational efficiency we limit ourselves to bandwidth calculations as outlined by Imbens and Kalyanaraman (2012), although in our empirical example we relax this requirement and show bandwidth is not particularly relevant in that context.

Finally, we illustrate this new model specification using the now canonical example of close elections in the U.S. House of Representatives (Lee 2008; Imbens and Kalyanaraman 2012; Calonico et al. 2014). We show that U.S. House districts are in fact spatially correlated and that this correlation impacts the estimates in a material way. Our results show that close wins in a district during time t leads to an increase in vote share at time \(t+1\) of approximately \(15\%\) with roughly \(9\%\) attributable to the direct effects of treatment on the treated, and approximately \(6\%\) the result of indirect effects. Original work by Lee (2008) and follow up work by Caughey and Sekhon (2011) yield estimates of a direct effect between \(7\%\) and \(9\%\) depending on the specification. Possible mechanisms which would produce the indirect effects include a shift in ad-buying resources, increased voter turnout following a nearby win, and voter migration. These dimensions are not captured in the data so channels by which the indirect effects manifest themselves are speculative.

The remainder of this paper is organized as follows. Section 2 outlines the proposed model specification including interpretation of the resulting parameter(s). Section 3 includes a Monte Carlo study to examine the properties of the proposed specification. In Sect. 4 we contextualize the generalized RD framework using close elections in the U.S. House of Representatives. Finally, Sect. 5 concludes and offers avenues for additional research.

Generalizing the RD design

This section is presented through three subsections. The first outlines the general RD framework with focus on local-linear structures within a bandwidth around the cutoff. The second provides background on network models. Additionally, we utilize these first two subsections to establish the mechanical framework and notation used in outlining our proposed model specification outlined in the third subsection.

Regression discontinuity design

RD is a method which has received quite a bit of renewed attention in recent years. Its primary purpose is to draw out estimates which are causal in nature by exploiting a deterministic relationship between a continuous running variable (RV) and a dichotomous treatment. For concreteness, suppose there is some treatment which is a monotonic, deterministic, and discontinuous function of some continuous variable, \(z_i\), where \(i = (1,\ldots ,N)\). For our purposes we refer to \(z_i\) as the value of the RV for the \(i^{th}\) individual, and Z to denote the \(N \times 1\) vector of values for each of the observations in a sample. Further, suppose that the treatment is of the form,

$$\begin{aligned} q_i = \left\{ \begin{array}{ll} 1&{}\quad z_i \ge z_0 \\ 0 &{}\quad z_i < z_0 \end{array}\right. \end{aligned}$$
(1)

where \(z_0\) is a sharp cutoff for treatment eligibility and application. Again, we use \(q_i\) to denote treatment of the \(i^{th}\) agent and Q refers to an \(N \times 1\) vector of indicators for the entire sample. Note that we, for ease of exposition, confine ourselves to the sharp RD case.

The interest here is to estimate,

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[(y_i|q_i = 1) - (y_i|q_i = 0)|z_i \in b], \end{aligned}$$
(2)

where \((y_i|q_i = 1)\) is the observed outcome, conditional on the ith agent being treated (\(q_i = 1\)), and \((y_i|q_i = 0)\) is the corresponding counterfactual. Since the counterfactual is unobserved, RD relies on estimating the average treatment effect by utilizing the discontinuity of Y, within the bandwidth, b, around the cutoff, \(z_0\). This can be written as,

$$\begin{aligned} \tau&= {{\,\mathrm{\mathbb {E}}\,}}[(y_i|q_i = 1) - (y_i|q_i = 0)|z_i \in b] \nonumber \\&= \frac{1}{N_{q = 1}}\sum _{z_i \in b, q_i = 1}y_i - \frac{1}{N_{q = 0}}\sum _{z_i \in b, q_i = 0}y_i \end{aligned}$$
(3)

Surprisingly there are a number of ways one could estimate the treatment effect, \(\gamma \), including local-linear methods, kernel regression, high-order polynomials, and others. For our purposes we focus on local-linear methods, that is a pair of linear regressions on either side of the cutoff, for a few reasons. First, this allows for the slope on either side of the cutoff to vary (Lee and Lemieux 2010), a parameter we see no need to constrain as equal.Footnote 3 Second, it has recently been shown that inferences drawn from high-order polynomial models are untrustworthy, with noisy point estimates and confidence intervals with ill-defined coverage (Gelman and Imbens 2018). Using a local-linear model in a Bayesian framework produces transparent credible intervals using standard probability theory, a point to which we will return later. Finally, while the econometric literature focusing on RD tends towards relaxing the assumptions of linearity the most common use in applied work tends to include linear assumptions as a primary modeling mechanism, see Thistlethwaite and Campbell (1960), Ludwig and Miller (2007), and Hidano et al. (2015) among others.

Prior to estimating a local-linear model it is necessary to establish a bandwidth, b, which is the subset of observations deemed to be quasi-randomized. This bandwidth is data driven and identifies which observations are likely to be interchangeable around the cut-off. As one would expect, since this bandwidth requires a researcher to discard data, those observation outside of the bandwidth, there is quite the discussion regarding its selection. The methodology here is presented using the data-driven bandwidth constructed by Imbens and Kalyanaraman (2012), though bandwidth construction using methods outlined by Calonico et al. (2014) is more commonly used in current empirical work. The reason for our use of bandwidths calculated using methods from Imbens and Kalyanaraman (2012) will become apparent later. It is important to note that construction of the bandwidth and subsequent discarding of data is done solely for identification purposes. It facilitates the imitation of an RCT and allows for the parameter estimate to be viewed as causal rather than correlative.

Going forward, and for concreteness, we will use \(\bar{\cdot }\) to denote observations above the cutoff, and for those below. Local-linear methods estimate the treatment parameter, \(\gamma \), using only those observations in b, through the use of two equations:

(4)
(5)

where \(\varvec{\bar{Z}} = [\iota _{\bar{n}} \ (\bar{Z} - z_0)]\), \(\varvec{\bar{\gamma }} = [\bar{\gamma } \ \bar{\zeta }]'\), . By centralizing the RV through \((\bar{Z}-z_0)\) and the intercept of each equation provides the value at \(z_0\), or more directly, the difference between them, , outlines the size of the discontinuity at the threshold, which is the parameter of interest. Again, it is important to recognize that this is a local average treatment effect and any generalizations to the average treatment effect on the population should be tempered by the lack of external validity in the method.

Network models

Peer effects, that is the impact of changing one’s own outcome or characteristics has on neighboring outcomes, is quickly becoming a topic of great interest with respect to causal inference (Athey and Imbens 2017; Kolak and Anselin 2020). The importance of peer effects is made clear through violations of the no-interference assumption (see Imbens and Rubin 2015 for a discussion) which is critical to the potential outcome framework commonly used in economic literature. Our main purpose in this section is to provide the notational framework we will use going forward. Networks are characterized in variety of ways, however a few attributes are relatively ubiquitous through the literature. First, each network is typically represented by a graph, \(\varvec{\mathcal {G}}\). this graph is made up of vertices, \(\varvec{\mathcal {V}}\), and edges, \(\varvec{\mathcal {E}}\). The vertices are indexed by \(i = 1,\ldots ,N\), are typically a finite set, and are the unit of observation in the analysis (e.g. agents, firms, regions, etc.). An edge is formed when two agents are connected, that is \(ij \in \varvec{\mathcal {E}}\).

Algebraically, these graphs are represented through an adjacency or weight matrix which we will denote as \(\varvec{G}\). Let \(g_{ij} = 1\) if an edge exists between agents i and j and zero otherwise. In a directed graph it is not necessary that the connection be reciprocated, that is \(g_{ij} \ne g_{ji} \ \forall \ i, j\), while in an undirected graph \(g_{ij} = g_{ji} \ \forall \ i, j\). We assume that agents cannot be connected to themselves, that is \(g_{ii} = 0 \ \forall \ i\), and as a result the maximum number of connections is \(N^2-N\), though in practice this matrix tends to be quite sparse. Finally, we make no assumption regarding the symmetry of \(\varvec{G}\) and use an asymmetric weight matrix for simulations in Sect. 3, and a symmetric matrix in Sect. 4.

Using the above notation we can write the following model,

$$\begin{aligned} y_i = \alpha + \rho \sum _{j = 1}^{N}g_{ij}y_j + \sum _{k = 1}^{K}\sum _{i = 1}^{N}\beta ^k x_i^k + \sum _{k = 1}^{K}\sum _{j = 1}^{N}\phi ^kg_{ij}x_j^k + \epsilon _i. \end{aligned}$$
(6)

Which we can write in matrix form as,

$$\begin{aligned} Y = \rho \varvec{G}Y + X\beta + \varvec{G}X\phi + \epsilon \end{aligned}$$
(7)

with the data generating process written as,

$$\begin{aligned} Y = (I_N - \rho \varvec{G})^{-1}(X\beta + \varvec{G}X\phi + \epsilon ). \end{aligned}$$
(8)

As defined earlier, Y, is the \(N\times 1\) vector of outcomes and X is an \(N\times K\) information matrix. We assume that the error term, \(\epsilon \) is identically and independently distributed with mean zero and variance \(\sigma ^2\). Additionally, we assume that \(\varvec{G}\) is exogenous.Footnote 4

In this structure both \(\varvec{G}Y\) and \(\varvec{G}X\) represent the linear combination of connected agent outcomes and characteristics respectively. Astute readers will recognize this as the Durbin class of model commonly used in spatial econometrics (LeSage 2014). In practice this matrix is often row-normalized meaning that the term \(\rho \varvec{G}Y\) represents the weighted average of neighboring outcomes. Additionally, the row-normalization acts as a bounding mechanism ensuring that \(\rho \in (\psi _{min}^{-1},1)\) where \(\psi _{min}^{-1 }\) is the minimum real eigenvalue of \(\varvec{G}\).

It is important to note that the partial derivative from the network model outlined above differs dramatically from those of a standard linear regression. For the kth variable the partial derivative for the form can be expressed as,

$$\begin{aligned} \delta y/\delta x^k = (I_N - \rho \varvec{G})^{-1}(I_N\beta ^k + \varvec{G}\phi ^k), \end{aligned}$$
(9)

while the latter is \(n^{-1}tr(I_N\beta ^k)\), where \(tr(\cdot )\) is the trace operator; this simplifies to \(\beta \). Since Eq. 9 is a dense matrix representing a global spillover structure it is useful to convert the information into scalar summaries (LeSage and Pace 2009). Letting \(S_K(\varvec{G}) = \frac{\delta y}{\delta x^k}\), the total impact to an observation is \(n^{-1}\iota _n'(S_k(\varvec{G})\iota _n)\) while the total impact from an observation is \(n^{-1}(\iota _n'S_k(\varvec{G}))\iota _n\), where \(\iota _n\) is an N-dimensional vector of ones (LeSage and Pace 2009). In practice, for an undirected and symmetric weight matrix, these are the same value and are known as the total effects. Total effects can be decomposed into direct and indirect effects:

$$\begin{aligned} \underbrace{n^{-1}(\iota _n'S_k(\varvec{G}))\iota _n}_{\mathrm {Total \ Effects}} = \underbrace{n^{-1}tr(S_k(\varvec{G}))}_{\mathrm {Direct \ Effects}} + \underbrace{\big (n^{-1}(\iota _n'S_k(\varvec{G}))\iota _n - n^{-1}tr(S_k(\varvec{G}))\big )}_{\mathrm {Indirect \ Effects}} \end{aligned}$$
(10)

The direct effects, that is the impact of changes in one’s own characteristics as well as feedback stemming from changes in one’s neighbors, are \(n^{-1}tr(S_k(\varvec{G}))\), where \(tr(\cdot )\) is the trace operator. The indirect effects are simply the total minus the direct effects and represent the average cross-partial derivative. These summary measures reflect how changes propagate through the simultaneous dependence structure and result in a new steady state.

Network regression discontinuity design

The previous two sections have shown that in order to properly identify the treatment effect in an RD setting it is preferable, if not necessary, to cutaway the data not within the bandwidth around the cutoff. This leaves a quasi-randomization of the remaining observations around the cutoff—thereby approximating an RCT—which is then subject to a local linear model (or alternative specification e.g., non-parametric modeling) to estimate the treatment effect. This is in direct conflict with the concept of network models, which shows that the connectivity structure of the network is important to accurate inference from any model.

The root issue of incorporating these cross-sectional interactions lays with the structure of the adjacency or weight matrix. This matrix defines the connectivity structure—how information is transmitted between the observations—and includes interactions between those agents within the bandwidth around the cutoff, and those that are not. Removing these observations introduces bias in the parameter estimates since the outcomes of those in the bandwidth may be dependent upon those of the discarded observations, or their characteristics.Footnote 5

We solve this problem through a multi-stage estimation procedure which incorporates elements of residualization and numeric integration nested within a Bayesian sampler. We do this in three general stages:

  1. 1.

    Residualize the outcome by using the appropriate spatial specification (e.g., Eq. 7). Note that the treatment variable is not included in this stage and as a result is captured in the residuals. Conditional upon these parameters, filter the network effects out of the residuals.

  2. 2.

    Estimate the treatment effect using the filtered residuals on a weighted sub-sample determined by the chosen kernel and bandwidth.

  3. 3.

    Conditional upon the estimated treatment effect, re-estimate the original spatial specification using the entire sample.

The first stage requires us to residualize the outcome by estimating the parameters of interest in the model outlined by Eq. 7; think of this as netting out the variation from characteristics which are not our treatment while at the same time estimating the scalar parameter governing strength of the network. Residualization is by no means novel in and of itself, it has become quite popular over recent years (Sales and Hansen 2014; Chernozhukov et al. 2017; Terrier and Ridley 2018). Since the residuals are a function of an unknown parameter, \(\rho \), and an observed connectivity structure, \(\varvec{G}\), we employ numerical integration to marginalize \(\rho \) as a nuisance parameter.Footnote 6 To do this we assume that the connectivity matrix, \(\varvec{G}\), is strictly orthogonal to the treatment, Q. That is, treatment on observation i will impact the outcome for observation i but does not influence if observation j is treated and impacts the outcome of j only through changes in the outcome of i.

With this in mind, consider the following:

$$\begin{aligned} p(\beta ,\phi ,\sigma ^2, \rho |Y,X,\varvec{G}) \propto \pi (\beta ,\phi ,\sigma ^2)\pi (\rho )p(Y,X,\varvec{G}|\beta ,\phi ,\sigma ^2,\rho ), \end{aligned}$$
(11)

where \(p(Y,X,\varvec{G}|\beta ,\phi ,\sigma ^2,\rho )\) is the likelihood, \(\pi (\beta ,\phi ,\sigma ^2)\) and \(\pi (\rho )\) represent suitable priors over the listed parameters, and \(p(\beta ,\phi ,\sigma ^2, \rho |Y,X,\varvec{G})\) is the joint posterior distribution. Assuming the process is linear, or is appropriately modeled as such, and utilizing the common Normal Inverted-Gamma (NIG) structure we can write (11) as,

$$\begin{aligned} p(\beta ,\phi ,\sigma ^2,\rho )&\propto \sigma ^{-n-1}|A| \nonumber \\&\times \exp \bigg [-\frac{1}{2\sigma ^2}\big ((AY - X\beta - \varvec{G}X\phi )'(AY - X\beta - \varvec{G}X\phi )\big )\bigg ]\pi (\rho ), \end{aligned}$$
(12)

where \(A = (I_N - \rho \varvec{G})\) and \(\pi (\rho )\) is the prior distribution on \(\rho \).Footnote 7 Typically, this is either a Uniform or Beta distribution (see LeSage and Pace 2009 for a discussion). Note we have included the intercept as a column of ones in X. This is the posterior form for equation (7). First, let \(B = [\beta ' ,\phi ']'\), \(X_{\varvec{G}} = [\iota _N \ X \ \varvec{G} X]\), \(\pi (B) \sim N(b_0, V_0)\), \(\pi (\sigma ^2) \sim IG(a_0,c_0)\), and \(\pi (\rho ) \sim Beta(a_1,a_2)\). This leads to a sampler based on the following conditional distributions,

$$\begin{aligned} p(B_{(m)}|\sigma ^2_{(0)}, \rho _{(0)})&\sim N(D_Bd_B,\sigma ^2D_B) \end{aligned}$$
(13)
$$\begin{aligned} D_B&= (X'_{\varvec{G}}X_{\varvec{G}} + V_B^{-1})^{-1} \nonumber \\ d_B&= X'_{\varvec{G}}(I - \rho \varvec{G})Y + V_bb_B \nonumber \\ p(\sigma ^2_{(m)}|\rho _{(0)}, \beta _{(m)})&\sim IG(a,c) \end{aligned}$$
(14)
$$\begin{aligned} a&= a_0 + N/2 \nonumber \\ c&= c_0 + (AY - X_{\varvec{G}}B)'(AY - X_{\varvec{G}}B)/2 \nonumber \\ A&= (I - \rho \varvec{G}) \nonumber \\ p(\rho _{(m)}|\beta _{(m)}, \sigma ^2_{(m)})&\propto |I - \rho \varvec{G}| \exp \bigg (\frac{1}{2\sigma ^2}(AY - X_{\varvec{G}}B)'(AY - X_{\varvec{G}}B)\bigg ) \end{aligned}$$
(15)

Note that Eq. 15 is of unknown form leading to either a draw by integration and inversion (aka “Griddy Gibbs”) or a Metropolis–Hastings (M–H) step (LeSage and Pace 2009). So far, we have broken no new ground and these results are well known; it is here where we depart from the standard approach.

Having obtained estimates for B, \(\sigma ^2\), and \(\rho \) we residualize the outcome by subtracting the conditional mean, \((I - \rho \varvec{G})^{-1}X_{\varvec{G}}B\). However, we cannot simply use this residualized outcome since it is still a function of \(\rho \) and \(\varvec{G}\). Consider the following,

$$\begin{aligned} \epsilon _{(m)} = Y - (I - \rho _{(m)}\varvec{G})^{-1}X_{\varvec{G}}B_{(m)} \end{aligned}$$
(16)

which implies that,

$$\begin{aligned} \epsilon _{(m)} \sim N(0,(I-\rho _{(m)}\varvec{G})^{-1}(I-\rho _{(m)}\varvec{G})^{-1'}\tau ). \end{aligned}$$
(17)

Using this, we filter \(\epsilon \) by numerically integrating over \(\rho \) and B. That is to say, for each draw (m) of \(\rho \) and B in the sampler we create a new vector of filtered residuals by,

$$\begin{aligned} {\tilde{\epsilon }}_{(m)} = (I-\rho _{(m)}\varvec{G})(Y- (I-\rho _{(m)}\varvec{G})^{-1}X_{\varvec{G}}B_{(m)}), \end{aligned}$$
(18)

which results in,

$$\begin{aligned} \tilde{\epsilon }_{(m)} \sim N(0,I\tau ), \end{aligned}$$
(19)

where \(m = 1,\ldots ,M\) and M is a suitable number of post burn-in iterations through the sampler.

Here we have residualized the outcome as mentioned in Lee and Lemieux (2010) and then, recognizing that the residuals are correlated, transformed them through our draw of the dependence parameter. Subjecting \(\epsilon \) to a Moran I test for cross-sectional correlation would tend to reject the null of no correlation, while the opposite is true for \(\tilde{\epsilon }\). It is important to remember that, since \(\rho \) is an unknown parameter with unknown form, the filtering of residuals with each draw from the joint posterior ensures uncertainty from the initial parameter estimates will propagate through the remainder of the sampler. At this point it should be clear that we have distilled this larger problem involving cross-section dependence through network connections down to a more simple structure that fits within the traditional RD framework.

Now that we have a residual, \(\tilde{\epsilon }\), which is orthogonal to \(\varvec{G}\), X, \(\varvec{G}X\), and \(\rho \) we can consider establishing a bandwidth round the cutoff \(z_0\). Here, for the sake of brevity and computational efficiency, we use methods from Imbens and Kalyanaraman (2012) to calculate the bandwidth around \(z_0\). Note that we do this M number of times over the course of our sampler with M different vectors of residuals; as a result our estimate of the bandwidth will change at each iteration. These changes come from the posterior draw of the relevant residualization and filtering parameters creating a new set of residuals for each iteration. This creates a frontier of marginally relevant observations which move into, and out of the estimation over the sampler iterations. Since these bandwidths are data driven, and each time our data is changing based on uncertainty around the residualizing parameters, we are marginalizing the bandwidth as a nuisance parameter much like we did with the network itself. Additionally, this allows us to plot the posterior distribution of the bandwidth and construct posterior density intervals that quantify how much the bandwidth changes over the sampler.

Again, we assume that the resulting treatment can be modeled using a local-linear structure and using a NIG structure, see Eqs. (4) and (5), for both observations above and below the cutoff we sample from the following conditional distributions:

(20)
(21)
(22)
(23)

where \(\bar{\gamma }\) and represent the vector of parameters, \(\gamma \) and \(\zeta \), above and below the cutoff first introduced in Eqs. (4) and (5). We use \(\bar{\varvec{Z}}\) to denote the matrix \([\bar{\iota } \ \bar{Z}]\), where \(\bar{\iota }\) is a vector of ones above the cutoff, and below. For this stage we use kernel weighted observations based on the distance from the cutoff using Imbens and Kalyanaraman (2012). Draws from these conditional distributions produce a filtered local average treatment effect (LATE), .Footnote 8

Since there is a great deal of consternation with respect to appropriate inference from RD type models (Calonico et al. 2014; Gelman and Imbens 2018) we would like to take a few moments and point out what is rather transparent in a Bayesian sense but not so much when using frequentist estimation methods. The LATE, , is the difference between two conditional Gaussian distributions, which produces a new Gaussian distribution with mean \(\mu _1 - \mu _2\) and variance \(\sigma ^2_1 + \sigma _2^2 - 2\mathrm {Cov}(\cdot )\), where \(\mathrm {Cov}(\cdot )\) is the covariance. In Sect. 3 we show, through simulations, that the intervals produced for \(\gamma \) by the process outlined herein are nearly equivalent to those found by using interval estimates from Calonico et al. (2014).

Obtaining a filtered estimate of the LATE puts us in similar position to current RD methodology, and indeed if that is all one was interested in then stopping here seems appropriate. However, as mentioned earlier, we are interested in being able to evaluate a policy which may produce spillovers relevant to its evaluation. To accomplish this we return to the original model in our sampler and, conditional upon the data and the drawn LATE, sample from the following conditional distributions:

(24)
(25)
(26)

Recall that \(B = [\beta ','\phi ']'\), Q is the vector of treatment indicators outlined in Eq. (1), and the estimates for both \(\beta \) and \(\phi \) are conditional upon the values. This is particularly relevant since the estimates which produce \(\gamma \) come from a smaller sample size, and as mentioned earlier, the variance of the LATE is equal to the sum of its component variances. By conditioning upon these draws from Eqs. (4) and (5) we carry that uncertainty into the full sample estimates of other parameters.

The full algorithm can be written as:Footnote 9

  1. 1.

    Establish parameters on prior distributions, arbitrary starting values for each parameter, and number of iterations with suitable burn-in period.

  2. 2.

    Sample from \(p(B_{(1)}|\sigma ^2_{(0)}, \rho _{(0)}) \sim N(D_Bd_B,\sigma ^2_{(0)}D_B)\)

  3. 3.

    Sample from \(p(\sigma ^2_{(1)}|\rho _{(0)}, B_{(1)}) \sim IG(a,c)\)

  4. 4.

    Sample from

    $$\begin{aligned} p(\rho _{(1)}|B_{(1)}, \sigma ^2_{(1)}&\propto |I - \rho _{(1)}\varvec{G}| \nonumber \\&\times \exp \bigg (\frac{1}{2\sigma ^2_{(1)}}(AY - X_{\varvec{G}}B_{(1)})'(AY - X_{\varvec{G}}B_{(1)})\bigg ) \end{aligned}$$
    (27)

    using “Griddy Gibbs” or M–H.

  5. 5.

    Create vector of filtered residuals via \(\tilde{\epsilon }_{(1)} = (I - \rho _{(1)}\varvec{G})(Y - (I - \rho _{(1)}\varvec{G})^{-1}X_{\varvec{G}}B_{(1)})\)

  6. 6.

    Calculate the bandwidth using chosen method conditional on \(\tilde{\epsilon }_{(1)}\).

  7. 7.

    Sample from \(p\big (\bar{\varvec{\gamma }}|\bar{\tau }_{(0)}, \bar{\epsilon }_{(1)}, (\bar{Z} - z_0)\big ) \sim N(D_{\bar{\gamma }}d_{\bar{\gamma }}, \bar{\tau }D_{\bar{\gamma }})\)

  8. 8.

    Sample from

  9. 9.

    Sample from \(p\big (\bar{\tau }_{(1)}|\bar{\varvec{\gamma }}, \bar{\epsilon }_{(1)}, (\bar{Z} - z_0)\big ) \sim IG(\bar{a},\bar{c})\)

  10. 10.

    Sample from

  11. 11.

    Sample from

  12. 12.

    Sample from

  13. 13.

    Sample from

    using “Griddy Gibbs” or M–H.

  14. 14.

    Return to 2. until M number of draws is complete where M is a suitable post burn-in value.

The result of this sampler is the ability to evaluate the LATE estimate within the context of cross-sectional interactions. We want to highlight an important assumption that the reader shouldn’t forget. The estimate produced using Eqs. (20) and (22) is a local average, not an average, treatment estimate. By conditioning the full sample estimates on the LATE we are using the LATE as a proxy for the ATE, specifically so that we can evaluate the partial derivatives of the Durbin model.

The partial derivative with respect to the treatment effect can be written as,

$$\begin{aligned} \Delta y / \Delta q = S_{\gamma }(\varvec{G}) = (I_n - \tilde{\rho }\varvec{G})^{-1}(I_n\gamma ). \end{aligned}$$
(28)

This partial derivative is a dense matrix which outlines the long-run equilibrium effects of changes on treatment status. Note that this partial derivative is different from that outlined in Eq. 9. Since Q is a deterministic function of Z it stands to reason that any network based lag effect of treatment would be contained in \(\varvec{G}Z\) rather than \(\varvec{G}Q\). For example, if Z were test scores as in Thistlethwaite and Campbell (1960), and Q was the resulting scholarship award contingent upon Z, then the test scores of neighbors would be important, not the scholarship receipt. While one might want to include \(\varvec{G}Z\) in the second stage of the sampler we would recommend against that for two reasons. First, there is a deterministic link between Q and Z through the threshold, and while \(z_i\) determines agent i’s eligibility for treatment it is not the case that \(z_j\) would impact \(q_i\). Second, \(\varvec{G}Z\) would not appear in the partial derivative above because what is carried through to the third stage is the difference in intercept from the local-linear equations, not the additional parameters. Finally, recall that the second stage of the sampler is using only observations within the bandwidth and thus \(\varvec{G}Z\) cannot be used in its entirety anyway.

From this matrix we construct scalar summaries to characterize the effects (LeSage and Pace 2009). Total effects, \(n^{-1}\iota _n'(S_\gamma (\varvec{G})\iota _n)\), can be decomposed into the direct and indirect effects as outlined in Eq. 10. Direct effects are the average of the main diagonal, \(n^{-1}tr(S_\gamma (\varvec{G}))\), where \(tr(\cdot )\) is the trace operator; which represents the direct effect of the treatment on the treated. It should be noted that, since Eq. 28 can be written as a Taylor expansion, the main diagonal includes feedback from higher order terms. The indirect effects are the average of cumulated off-diagonal terms, \(n^{-1}\iota _n'(S_\gamma (\varvec{G})\iota _n) - n^{-1}tr(S_\gamma (\varvec{G}))\).This means the scalar summary expressions reflect an average of cumulative spillovers from treatment on treated and untreated units alike, where cumulation is over neighboring observations and averaging takes place across all sample observations. Current RD frameworks assume that these indirect effects are zero through the no-interference assumption.

In this section we outlined how to obtain causal estimates of a treatment effect in the presence of cross-sectional interactions commonly found in network models. In the next section we will present Monte Carlo study results, which show that the proposed method acts as a generalized RD framework not only allowing for the valuation of richer partial derivative but also producing estimates with reduced bias and mean square error at \(\rho = 0\).

Simulations

To study the efficacy of the proposed model specification we performed a Monte Carlo study where the primary data generating process can be written as:

$$\begin{aligned} Y&= (I - \rho \varvec{G})^{-1}(X\beta + \varvec{G}X\phi + \gamma Q + \epsilon ), \end{aligned}$$
(29)
$$\begin{aligned} q_i&= {\left\{ \begin{array}{ll} 1 &{}\quad z_i \ge z_0 \\ 0 &{}\quad z_i < z_0 \end{array}\right. } \end{aligned}$$
(30)
$$\begin{aligned} \epsilon&\sim N(0,\sigma ^2). \end{aligned}$$
(31)

In each simulation the sample size is \(N = 1,500\) and the number of covariates, K, is set to four. The network is constructed by distributing the sample across a plane and using Voronoi Tesselation to construct an undirected adjacency matrix. We use a queen contiguity design—that is connections are determined through shared borders and vertices—and while the results presented herein use only this structure, additional simulations have been done using other common structures (e.g. rook or bishop contiguity, K-nearest neighbor, minimum distance, etc.) as well as directed graph structures more commonly found in network analysis; code for this is available upon request. The information matrix, X, is drawn from a multivariate normal distribution such that \(X \sim \mathrm {MVN}(\varvec{0},I\sigma ^2)\) with \(\beta _1 = \cdots = \beta _k = 1\) and \(\phi _1 = \cdots \phi _k = 0\). The running variable is distributed as \(Z \sim N(70,16)\) with \(z_0 = 73\) resulting in \(21.4\%\) of the observations being treated. We set \(\gamma = 2\) for all simulations. Figure 1 provides a visual reference for one data set generated under the outlined parameters. Note that this data set is one which exhibits no cross-sectional dependence (i.e. both \(\rho = 0\) and \(\theta = 0\)).

Fig. 1
figure1

Example of simulated data set

The dependence parameter is partitioned into 91 segments over the interval \((-0.90,0.90)\) with 1008 simulations completed for each value of \(\rho \). Since \(\varvec{G}\) is fixed, the domain of \(\rho \) remains fixed over the interval \((-2.005,1.000)\); our simulations are over a smaller interval because in practice negative values are rare, let alone large negative values. For each simulation we fix \(X,\ \varvec{G}, \ R\) and Q, drawing a new \(\epsilon \) with \(\sigma ^2\) such that the signal-to-noise ratio is maintained between 0.70 and 0.85 as calculated in LeSage and Pace (2017).

With these MC simulations we have two primary goals. First, we want to demonstrate that, over the domain of \(\rho \), the proposed method is well-behaved with respect to recovering the true treatment parameter.Footnote 10 For comparison, we will repeat the simulations using methods outlined by Imbens and Kalyanaraman (2012), Calonico et al. (2014) (both conventional and robust), and Chib and Greenberg (2014); the latter of which is a Bayesian nonparametric take on sharp and fuzzy RD structures.Footnote 11 Second, and more importantly, we want to illustrate the difference in partial derivatives between the proposed and alternative methods. In all cases we employ a set of uninformative, but proper priors. Table 1 outlines the treatment parameter at \(\rho = (-0.50, 0.00, 0.50)\) using each of the five methods.

Table 1 Comparison of parameter estimates: a selected look

It is important to note several subtle differences in this table between the established methods and the proposed. First, we choose to characterize the posterior distribution of the proposed through its mean. This choice is made because the methodology being evaluated is a sharp RD structure, and it is expected that \(\gamma \) is distributed in a Gaussian manner. This may not always be the case. For instance, in a Fuzzy RD structure the distribution of \(\gamma \) is a ratio of normal distributions, and it would likely be more appropriate to characterize that posterior through its mode. In both cases the value presented is the average of point estimates or posterior mean values where applicable. Additionally, while the values are similar, the standard deviation of the marginal posterior distribution is not equivalent to the standard error of the frequentist point estimate. Under very specific circumstances (e.g. known variance, flat priors) these two values can be equivalent, however in practice they generally are very different. Finally, under the proposed estimation method a bandwidth is calculated with every iteration of the algorithm. The value reported in Table 1 is the average bandwidth calculated over all simulations.Footnote 12

Since the non-spatial model is a special case of an SDM, the first block in Table 1 outlines estimates where the true DGP matches the assumptions of existing RD models. Here we can see that, despite \(\rho = 0\), the model uncertainty is such that the standard deviation of the posterior for treatment is slightly larger than that of Calonico et al. (2014). The downward bias comes from the marginalization of initial parameters used to residualize, and it is clear from these results that even at \(\rho = 0\) both the estimator bias and mean-squared error (MSE) are lowest in the proposed specification as compared to the alternative(s). Moving away from the case where \(\rho = 0\) we see an increase in both bias and MSE for each of alternative estimators. This bias is inversely related to the sign of \(\rho \) such that if \(\rho > 0\) the bias is negative and if \(\rho < 0\) it is positive. Moreover, the data-driven bandwidths produced by the estimators vary as \(\rho \) changes indicating a bias in the bandwidth determination that is introduced from the spatial dependence. Overall, Table 1 indicates that the proposed estimator produces estimates with less bias and mean-squared-error while producing consistent posterior standard deviations and bandwidths over the domain.

For the frequentist estimators we take the average of standard errors to construct the confidence interval for those simulations, outlined in Table 2. At \(\rho = 0\), and indeed for each of the four values, this confidence interval contains the true value on average. Despite containing the true value on average, these intervals are getting wider as \(\rho \) moves away from zero. The intervals, while similar in value for \(\rho = 0\), differ through their interpretation. Intervals using methods outlined by Chib and Greenberg (2014) and the proposed method are the \(95\%\) highest posterior density (HPD), which is the shortest of possible Bayesian intervals (Casella and Berger 2002). Unlike the confidence interval, an HPD is a probability statement specifically indicating that the true value lies within the interval with probability 0.95. Going forward we will refer to these intervals in a comparative, and interchangeable fashion, though we encourage the reader to see Casella and Berger (2002) and Gelman et al. (2013) for a full discussion of the differences.

Table 2 Interval estimates: a selected look

It is with this in mind that we would like to point out the similarities, at \(\rho = 0\), between the constructed CIs and HPDs found in Table 2. Note that, according to Calonico et al. (2014), the main difference between the CI of conventional RD estimation and their robust estimator is the bias correction in the bandwidth, and new standard error calculations based on that correction. Their method ultimately produces a smaller bandwidth with larger standard errors that have better coverage properties. Indeed, our simulations confirm these results and estimators from Calonico et al. (2014) do in fact have simulated coverage closer to \(95\%\) when \(\rho = 0\). However, over the domain of \(\rho \), coverage for estimates from both Calonico et al. (2014) and Imbens and Kalyanaraman (2012) methods vary wildly from a low of \(80\%\) to a high of \(100\%\).

Of course, the parameter itself in a Durbin model is of little importance since it is only one input into the partial derivative. It may then be useful to consider comparing the direct effects. By assumption there are no spillovers in typical RD estimation methods, meaning that the direct effects are explicitly assumed to be equivalent ot the total effects. Figure 2e, f outline the bias and MSE if the estimates from non-spatial estimation methods are considered to be direct effects themselves. A similar story emerges under this view; the bias and MSE are both U-shaped, however many of the curves have shifted down, with absolute bias being smaller in magnitude. Interestingly, for estimates produced using methods outlined by Imbens and Kalyanaraman (2012), at no point under the domain examined is there not a bias where as, when comparing the parameters direct, the bias disappeared at \(\rho \approx -0.50\). Meanwhile, the cross for other alternative estimation methods still occurs at \(\rho \approx 0\) showing that they have an unbiased estimate of the treatment with no spatial dependence.

Fig. 2
figure2

Bias and MSE treatment effects. Note Here we have plotted the bias and mean squared error for each facet of the treatment parameter (e.g., direct and indirect effects). For each estimator, CCT Conventional (red solid line), CCT Robust (blue short dashed line), IK (orange dotted line), CG (purple long dashed line), and the proposed (black dotted dashed line) we plot the corresponding measure over the domain of \(\rho \)

Table 3 Partial effects estimates

Table 3 shows that, if \(\rho \ne 0\), the direct effect, as estimated using methods put forth by both Calonico et al. (2014) and Imbens and Kalyanaraman (2012), are close to the true value but exhibit noticeable bias with relatively wide confidence intervals (as compared to \(\rho = 0\)). More striking is the estimate of the total effect. For \(\rho = -0.50\), the estimate of the total effects from both Calonico et al. (2014) and Imbens and Kalyanaraman (2012) estimators overstate the welfare effect of the treatment by \(\approx 64\%\). For values of \(\rho = 0.50\) the opposite is true, the estimators of Calonico et al. (2014) and Imbens and Kalyanaraman (2012) understate the total welfare effect by \(\approx 100\%\). Remember that for conventional estimators there are no indirect effects by construction. As a result, the total effect is the direct effect. This is particularly pertinent to policy decisions based on the effect size since inaccurate assessment of effect size can create inefficiencies in public spending on that treatment.

Empirical example: U.S. house elections

To ground the proposed method in real data we turn to what has become a common example used in RD econometrics literature; election results from U.S. legislative bodies. These studies often look at party incumbency effects by the U.S. House of Representatives (Lee 2008; Broockman 2009; Butler 2009; Caughey and Sekhon 2011; Cho et al. 2012; Calonico et al. 2014) or U.S. Senate (Albouy 2013; Chib and Greenberg 2014). Here we look at results from elections in the U.S. House of Representatives from 1946 until 1996 and, similar to both Lee (2008) and Caughey and Sekhon (2011), we look at the effect of a close (Democrat) win, as measured by margin of victory (DifDPct), in time period t on vote share (Democrat) in time period \(t + 1\) (DPctNxt).Footnote 13

While additional information is not strictly necessary (Lee and Lemieux 2010) we use some of the data collected by Caughey and Sekhon (2011) as controls.Footnote 14 Specifically we use a dummy indicator for the party affiliation of the Secretary of State (SoSDem), the margin of victory in presidential elections averaged over the decade (DifPVDec), the percentage of government workers in a district (GovWkPct), the percentage of population considered to be living in an urban setting (UrbanPct), the percentage of Black voters in the district (BlackPct), and the percentage of population which is foreign born (ForgnPct). Table 4 outlines the descriptive statistics of the data.

Table 4 Summary statistics: U.S. house elections 1945–1995

We obtain the weight matrix by combining the above data with latitude and longitude from historical congressional district maps (Lewis et al. 2013) using a queen contiguity design. This is done for each congress separately and then the complete weight matrix is of a block diagonal form. We chose this structure for its simplicity rather than its strict validity. It is important to note that while U.S. House districts are adjusted following the Decennial Census there is a great deal of correlation in the neighboring districts across the election years. Many times a district will have the same neighbors throughout the sample.Footnote 15 A Moran’s I test, with a null hypothesis of no correlation, produces a statistic of 51.9902 indicating substantial evidence against the null. This is further buttressed by a naive look at model residuals using simple OLS and an SDM model estimated using the maximum likelihood. Since a the OLS estimator is a special case of the SDM where \(\rho \) and \(\phi \) equals zero the residuals produced by each model would be nearly identical in the absence of dependence. While highly correlated, (0.865), the residuals differ enough to provide evidence of cross-sectional dependence. This, combined with the results of the Moran’s I, indicate that a Durbin model is likely a correct specification.Footnote 16

Fig. 3
figure3

a Here the raw data is plotted to show a clear discontinuity at zero without binning the data. Vote share in time period \(t+1\) is on the Y-axis with the plus (red) icons referring to Republican vote share and dot (blue) referring to Democrat vote share. b Here the data is binned in a more traditional RD plot to show a clear discontinuity at zero. Note that we have sized the points relative to the number of observations in the bin. c Here the residualized outcome (using OLS) is plotted to show a clear discontinuity at zero without binning the data. Vote share in time period \(t+1\) is on the Y-axis with the plus (red) icons referring to Republican vote share and dot (blue) referring to Democrat vote share. d Here the residualized outcome (using OLS) is binned in a more traditional RD plot to show a clear discontinuity at zero. Note that we have sized the points relative to the number of observations in the bin. e Here the residualized outcome (at the conditional mean for all posterior parameter estimates using an SDM) is plotted to show a clear discontinuity at zero without binning the data. Vote share in time period \(t+1\) is on the Y-axis with the plus (red) icons referring to Republican vote share and dot (blue) referring to Democrat vote share. f Here the residualized outcome (at the conditional mean for all posterior parameter estimates using an SDM) is binned in a more traditional RD plot to show a clear discontinuity around zero. Note that we have sized the points relative to the number of observations in the bin

Figures 3a–f plots the outcome and running variable in a traditional RD type plot. Along the left column we plot the raw data from the outcome (Fig. 3a), non-spatial (Fig. 3c), and spatial residuals (Fig. 3e). The discontinuity is apparent both in Fig. 3a, e though far less convincing in Fig. 3c. The plots themselves provide no informative power regarding the significance (or lack thereof) of subsequent estimates but the difference in plots is such that the jump is clearer. Along the right column we have plotted the binned values over the running variable. The story repeats itself in these plots with the outcome Fig. 3b, f displaying a clear jump at zero. Moreover, in all three cases the area around the cutoff looks decidedly linear despite the non-linearity of the full sample.

As a reminder, while the original study by Lee (2008) was completed using a fourth order polynomial in the margin of victory, our pseudo-replication will be limited to a local linear specification. Our decision is supported by recent literature which casts doubt on higher order polynomial use in the RD setting (Gelman and Imbens 2018) as well as graphic evidence from Fig. 3a–f. Table 5 provides the results of our proposed specification and a comparative set of specifications.Footnote 17 For all results, the dependent variable is the residualized, rather than raw, outcome so as to keep the results as consistent as possible for comparison. The conventional estimates, those in the top block, are consistent with those found in Lee (2008) where winning a close election in time t provides a roughly \(7.5\%\) boost to votge share in time \(t+1\).Footnote 18

Table 5 Effect of close house elections in \(\%\): residualized outcome (DPctNxt)

While the point estimates in the first block of Table 5 are similar, the standard errors (or standard deviation in the case of CG) vary significantly and produce different confidence intervals. The widest of intervals in this block are produced by Calonico et al. (2014) methods, which are consistent with previously published results. The second through fourth blocks of this table outline the empirical results using the proposed specification with variation in how the bandwidth is calculated. In our simulations we used the method of bandwidth calculation put forth by Imbens and Kalyanaraman (2012) primarily for its computational efficiency. In this particular example the choice of bandwidth under the proposed method is irrelevant as all three options produce nearly identical results both in point estimates and standard deviation. Results from the original study (Lee 2008) show between a \(7.7\%\) and \(8.1\%\) average effect while Caughey and Sekhon (2011) shows approximately \(9\%\) using a similar specification. While the results of the proposed method are similar with respect to th parameter estimate it is important to note that the most direct comparison is the parameter estimate of the non-spatial specifications to the direct effects from the spatial models. With this in mind we see that the estimated direct effect, \(9.2\%\) is larger than any of the non-spatial specifications. This is consistent with the simulations provided earlier given the estimated \(\rho \approx 0.4\).

Results using the proposed method indicate that the total effect of winning a close election in district A, in time period t, is a approximately a \(14.6\%\) increase in Democrat vote share in time \(t+1\). The direct effect of the treatment on the treated is approximately \(9.2\%\) with the remaining effects stemming from averaging the cumulated effects of neighbor spillovers. One possible explanation for such an increase is that, given a close win, the resource in subsequent campaigns can be shifted away from that district to neighboring districts due to the anticipated incumbency advantage.

In this section we have shown that elections in the U.S. House of Representatives from 1946 to 1996 do exhibit spatial dependence. We then compared the local-linear estimations for the treatment effect put forth by Imbens and Kalyanaraman (2012); Calonico et al. (2014); Arai and Ichimura (2018) to those of the proposed methods. We find significant and material indirect effects from close (Democrat) wins which support geographic clustering by voters. We vary the bandwidth calculation in our proposed method and show that the standard deviation of the marginal posterior distribution for the treatment has less variation than the frequentist standard errors. This leads to interval estimates which are roughly equivalent across each of the three alternatives and is informative in the sense that additional data admitted through a wider bandwidth are not materially impacting the results.

Conclusion

Regression discontinuity design has become a favored technique for identifying a causal relationship from secondary data. As these types of data continue to evolve and include measures of both space and network position it is important to consider how these cross-sectional interactions interact with the RD structure. In this paper we have outlined one method for estimating, in an RD framework, the effect of cross-sectional dependence on treatment. Failure to account for this dependence produces estimates of treatment which are potentially both biased and inefficient. We use a combination of residualization, metropolis-hastings guided marginalization, and Bayesian sampling methods to filter out the cross-sectional dependence and get a clean look at the local average treatment effect through a local-linear specification.

We provide simulated evidence that the omission of cross-sectional interactions from the standard RD framework can lead to inferential problems associated with the aforementioned bias and inefficiencies. Moreover, we showed that in the special case of \(\rho = 0\), that is a standard RD model, the proposed estimation method produces estimates with slightly lower bias and mean-squared-error than existing methods. This improvement comes from the incorporation of uncertainty both from the observations admitted to the estimation of treatment and the residualization parameters. We finished by demonstrating the proposed estimation method using the common RD example of close elections in the U.S. House of Representatives. We showed that the direct effect of the treatment on the treated is \(\approx 9\%\) while the average indirect effect from the treatment is \(\approx 6\%\). While the mechanisms which cause these indirect effects are not outlined in the data, one such avenue is a deployment of ground resources around a closely won district rather than within it in anticipation of the incumbency effect.

Notes

  1. 1.

    Note that this is different from border discontinuity which often is used as an identification strategy in contexts where geography is important (see Black 1999 as one example). Recent work by Kolak and Anselin (2020) emphasizes that the correct specification of the spatial process is critical in this context. We would like to point out that this type of RD design is beyond the scope of this paper, rather we focus on treatment assignment, which is determined not spatially, but through some other continuous running variable (e.g., income thresholds or vote share).

  2. 2.

    It is not clear that even if we gave each student a personalized exam that this channel would not produce any cross-sectional dependence. Students may still try to cheat and rather than improve their score from cheating off of their neighbor they may instead make their score worse. In such a situation we would have negative cross-sectional dependence.

  3. 3.

    In fact, as stated in Lee and Lemieux (2010), constraining the slope on either side of the cutoff to be equal is counter to the entire spirit of regression discontinuity. Rather than two regression equations, one for either side of the cutoff, a constrained slope on the running variable would be produced through a pooled regression, \(Y = \alpha _l + \tau *Q + f(Z - z_0) + \epsilon \) (see Section 4.3 p318). They show that if the constraint is enforced then data from the right side of the cutoff would be used to estimate the pertinent parameter on the left and vice versa.

  4. 4.

    We recognize that this is a very restrictive assumption and that it is unlikely, at the individual level, to hold with any degree of certainty. Endogenous weight matrices have seen a great deal of research work lately and we encourage readers to see Kelejian and Piras (2014), Hsieh and Lee (2016), Han and Lee (2016), Hsieh and Lin (2017) and Shi and Lee (2018) among others for examples.

  5. 5.

    It is important to remember that we are not saying the probability of an agent being treated is conditional upon another agent receiving treatment. Rather that their [the untreated agent] outcome is dependent on the characteristics and outcome of the treated agent.

  6. 6.

    We recognize that new work addresses the inclusion of covariates in the RD design (Frölich and Huber 2019; Calonico et al. 2019) though we stop short of implementing this since our objective is primarily to exploit the structure of the error term in an effort to marginalize the network.

  7. 7.

    It is not required that we assume linearity, in fact this has been done primarily as a convenience. In practice any likelihood function can be substituted in place of the normal distribution and conditional posteriors derived. This means that the error term need not be normally distributed though we do require it to be independent and identically distributed.

  8. 8.

    We would like to point out two important points regarding this estimate. First, because we have filtered out the network effects this estimate is a simple parameter estimate and does not give any indication of the spillovers that may take place due to the treatment. Finally, we do not include the spatially lagged running variable term for two reasons: first, the object of interest is \(\gamma \) for each equation, not the additional parameter estimates. Second, since we condition the third stage upon this particular value, the impact of a spatially lagged running variable, if any, would not be included in the third stage or subsequent partial derivatives.

  9. 9.

    We are happy to provide both an example file and full code, written in Matlab, upon request.

  10. 10.

    We present—for brevity—only the results for the parameter of interest, however the additional estimates (e.g. \(\beta \) or \(\phi \)) are available upon request.

  11. 11.

    For brevity we will not explicitly cover the details of each estimator, rather we encourage readers to visit the originating papers for more detailed information. Our primary focus in this comparison is that none of these RD estimators allow for spillover effects and thus, by construction, assume that \(\rho = 0\).

  12. 12.

    Again, we have chosen to characterize this value by its posterior mean however this distribution is tends to be skewed and/or multi-modal and it may be more appropriate to report mean, mode, and interval estimates for those who are interested.

  13. 13.

    It is important to note that while we have chosen to look at the effect of close Democrat wins we only do this to link our analysis to the previous literature. The model looking at close Republican wins would be a near mirror image of this model and provide qualitatively the same results.

  14. 14.

    We would like to thank the authors for making this data available through replication files hosted at https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/16357&version=1.0.

  15. 15.

    These adjustments are what led to Lee (2008) omitting the years ending in ’0’ and ’2’. We have found however that inclusion of these years does not materially change the results and we can provide supplementary tables upon request.

  16. 16.

    It is important to note that the SDM nests both the Spatial Autoregressive (SAR) model and OLS as a special case. Using the log-marginal likelihood we can calculate the model probabilities as in LeSage and Pace (2009). Comparing the SDM, SAR, and spatial error models (SEM) using a the first stage residualization clearly indicates the SDM as the preferred option.

  17. 17.

    Conventional Imbens and Kalyanaraman (2012) was estimated using functions written by the authors in Matlab while both Calonico et al. (2014) estimates and the Chib and Greenberg (2014) estimates were produced using their respective packages in R.

  18. 18.

    Interestingly Caughey and Sekhon (2011) find close to a \(9.25\%\) increase using the same data however, that was on the raw rather than residualized outcome. The conventional estimates we obtain using the raw outcome variable are quantitatively similar to this value.

References

  1. Albouy D (2013) Partisan representation in congress and the geographic distribution of federal funds. Rev Econ Stat 95(1):127–141

    Article  Google Scholar 

  2. Anderson ML (2014) Subways, strikes, and slowdowns: the impacts of public transit on traffic congestion. Am Econ Rev 104(9):2763–2796

    Article  Google Scholar 

  3. Angrist JD, Lavy V (1999) Using Maimonides’ rule to estimate the effect of class size on scholastic achievement. Q J Econ 114(2):533–575

  4. Anselin L (1988) Spatial econometrics: methods and models. Kluwer Academic Publishers, Dordrecht

    Book  Google Scholar 

  5. Anselin L, Arribas-Bel D (2013) Spatial fixed effects and spatial dependence in a single cross-section. Pap Reg Sci 92(1):3–17

    Article  Google Scholar 

  6. Arai Y, Ichimura H (2018) Simultaneous selection of optimal bandwidths for the sharp regression discontinuity estimator. Q Econ 9(1):441–482

    Article  Google Scholar 

  7. Athey S, Imbens GW (2017) The state of applied econometrics: causality and policy evaluation. J Econ Perspect 31(2):3–32

    Article  Google Scholar 

  8. Bin O, Poulter B, Dumas CF, Whitehead JC (2011) Measuring the impact of sea-level rise on coastal real estate: a hedonic property model approach. J Reg Sci 51(4):751–767

    Article  Google Scholar 

  9. Black SE (1999) Do better schools matter? Parental valuation of elementary education. Q J Econ 114(2):577–599

    Article  Google Scholar 

  10. Brasington DM (2017) School spending and new construction. Reg Sci Urban Econ 63(Supplement C):76–84

    Article  Google Scholar 

  11. Broockman DE (2009) Do congressional candidates have reverse coattails? Evidence from a regression discontinuity design. Polit Anal 17(4):418–434

    Article  Google Scholar 

  12. Butler DM (2009) A regression discontinuity design analysis of the incumbency advantage and tenure in the us house. Elect Stud 28(1):123–128

    Article  Google Scholar 

  13. Calonico S, Cattaneo MD, Farrell MH, Titiunik R (2019) Regression discontinuity designs using covariates. Rev Econ Stat 101(3):442–451

    Article  Google Scholar 

  14. Calonico S, Cattaneo MD, Titiunik R (2014) Robust nonparametric confidence intervals for regression-discontinuity designs. Econometrica 82(6):2295–2326

    Article  Google Scholar 

  15. Card D, Giuliano L (2016) Can tracking raise the test scores of high-ability minority students? Am Econ Rev 106(10):2783–2816

    Article  Google Scholar 

  16. Casella G, Berger RL (2002) Statistical inference, vol 2. Duxbury, Pacific Grove

    Google Scholar 

  17. Cattaneo MD, Frandsen BR, Titiunik R (2015) Randomization inference in the regression discontinuity design: an application to party advantages in the us senate. J Causal Inference 3(1):1–24

    Article  Google Scholar 

  18. Caughey D, Sekhon JS (2011) Elections and the regression discontinuity design: lessons from close us house races, 1942–2008. Polit Anal 19(4):385–408

    Article  Google Scholar 

  19. Chernozhukov V, Goldman M, Semenova V, Taddy M (2017) Orthogonal machine learning for demand estimation: high dimensional causal inference in dynamic panels. arXiv preprint arXiv:1712.09988

  20. Chib S, Greenberg E (2014) Nonparametric Bayes analysis of the sharp and fuzzy regression discontinuity designs. Technical report, Washington University St Louis, Olin School of Business

  21. Cho WKT, Gimpel JG, Shaw DR et al (2012) The tea party movement and the geography of collective action. Q J Polit Sci 7(2):105–133

    Article  Google Scholar 

  22. Cho WKT, Rudolph TJ (2008) Emanating political participation: untangling the spatial structure behind participation. Br J Polit Sci 38(2):273–289

    Article  Google Scholar 

  23. Cook TD, Wong VC (2008) Empirical tests of the validity of the regression discontinuity design. Annales d’Economie et de Statistique 10:127–150

  24. Cutts D, Webber DJ (2010) Voting patterns, party spending and relative location in England and Wales. Reg Stud 44(6):735–760

    Article  Google Scholar 

  25. De la Cuesta B, Imai K (2016) Misunderstandings about the regression discontinuity design in the study of close elections. Annu Rev Polit Sci 19:375–396

    Article  Google Scholar 

  26. Eggers AC, Fowler A, Hainmueller J, Hall AB, Snyder JM Jr (2015) On the validity of the regression discontinuity design for estimating electoral effects: new evidence from over 40,000 close races. Am J Polit Sci 59(1):259–274

    Article  Google Scholar 

  27. Frölich M, Huber M (2019) Including covariates in the regression discontinuity design. J Bus Econ Stat 37(4):736–748

    Article  Google Scholar 

  28. Gelman A, Imbens G (2018) Why high-order polynomials should not be used in regression discontinuity designs. J Bus Econ Stat 25:1–10

    Article  Google Scholar 

  29. Gelman A, Stern HS, Carlin JB, Dunson DB, Vehtari A, Rubin DB (2013) Bayesian data analysis. Chapman and Hall/CRC, Cambridge

    Book  Google Scholar 

  30. Hainmueller J, Kern HL (2008) Incumbency as a source of spillover effects in mixed electoral systems: evidence from a regression-discontinuity design. Elect Stud 27(2):213–227

    Article  Google Scholar 

  31. Han X, Lee L-F (2016) Bayesian analysis of spatial panel autoregressive models with time-varying endogenous spatial weight matrices, common factors, and random coefficients. J Bus Econ Stat 34(4):642–660

    Article  Google Scholar 

  32. Hidano N, Hoshino T, Sugiura A (2015) The effect of seismic hazard risk information on property prices: evidence from a spatial regression discontinuity design. Reg Sci Urban Econ 53:113–122

    Article  Google Scholar 

  33. Hsieh C-S, Lee LF (2016) A social interactions model with endogenous friendship formation and selectivity. J Appl Economet 31(2):301–319

    Article  Google Scholar 

  34. Hsieh C-S, Lin X (2017) Gender and racial peer effects with endogenous network formation. Reg Sci Urban Econ 67:135–147

    Article  Google Scholar 

  35. Imbens G, Kalyanaraman K (2012) Optimal bandwidth choice for the regression discontinuity estimator. Rev Econ Stud 79(3):933–959

    Article  Google Scholar 

  36. Imbens GW, Rubin DB (2015) Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, Cambridge

    Book  Google Scholar 

  37. Jacob BA, Lefgren L (2004) Remedial education and student achievement: a regression-discontinuity analysis. Rev Econ Stat 86(1):226–244

    Article  Google Scholar 

  38. Kelejian HH, Piras G (2014) Estimation of spatial models with endogenous weighting matrices, and an application to a demand model for cigarettes. Reg Sci Urban Econ 46:140–149

    Article  Google Scholar 

  39. Kelejian HH, Prucha IR (1998) A generalized spatial two-stage least squares procedure for estimating a spatial autoregressive model with autoregressive disturbances. J Real Estate Finance Econ 17(1):99–121

    Article  Google Scholar 

  40. Kelejian HH, Prucha IR (1999) A generalized moments estimator for the autoregressive parameter in a spatial model. Int Econ Rev 40(2):509–533

    Article  Google Scholar 

  41. Kim J, Elliott E, Wang D-M (2003) A spatial analysis of county-level outcomes in us presidential elections: 1988–2000. Elect Stud 22(4):741–761

    Article  Google Scholar 

  42. Kim J, Goldsmith P (2009) A spatial hedonic approach to assess the impact of swine production on residential property values. Environ Resource Econ 42(4):509–534

    Article  Google Scholar 

  43. Kolak M, Anselin L (2020) A spatial perspective on the econometrics of program evaluation. Int Reg Sci Rev 43(1–2):128–153

    Article  Google Scholar 

  44. Lacombe DJ, Holloway GJ, Shaughnessy TM (2014) Bayesian estimation of the spatial Durbin error model with an application to voter turnout in the 2004 presidential election. Int Reg Sci Rev 37(3):298–327

    Article  Google Scholar 

  45. Lazrak F, Nijkamp P, Rietveld P, Rouwendal J (2014) The market value of cultural heritage in urban areas: an application of spatial hedonic pricing. J Geogr Syst 16(1):89–114

    Article  Google Scholar 

  46. Lee DS (2008) Randomized experiments from non-random selection in us house elections. J Econom 142(2):675–697

    Article  Google Scholar 

  47. Lee DS, Lemieux T (2010) Regression discontinuity designs in economics. J Econ Lit 48(2):281–355

    Article  Google Scholar 

  48. LeSage J (2014) What regional scientists need to know about spatial econometrics. Rev Reg Stud 44(1):13–32

    Google Scholar 

  49. LeSage J, Pace RK (2009) Introduction to spatial econometrics. Chapman & Hall/CRC, Boca Raton

    Book  Google Scholar 

  50. LeSage JP, Pace RK (2017) Spatial econometric Monte Carlo studies: raising the bar

  51. Lewis JB, DeVine B, Pitcher L, Martis KC (2013) Digital boundary definitions of United States congressional districts, 1789–2012

  52. Ludwig J, Miller DL (2007) Does head start improve children’s life chances? Evidence from a regression discontinuity design. Q J Econ 122(1):159–208

  53. Macartney H, Singleton JD (2017) School boards and student segregation. Technical report, National Bureau of Economic Research

  54. Meng L (2013) Evaluating China’s poverty alleviation program: a regression discontinuity approach. J Public Econ 101:1–11

  55. Mihaescu O, vom Hofe R (2012) The impact of brownfields on residential property values in Cincinnati, Ohio: a spatial hedonic approach. J Reg Anal Policy 42(3):223

    Google Scholar 

  56. Moulton JG, Waller BD, Wentland S (2016) Who benefits from targeted property tax relief? Evidence from Virginia elections

  57. Pace RK, LeSage JP, Zhu S (2011) Spatial dependence in regressors and its effect on estimator performance

  58. Pettersson-Lidbom P (2008) Do parties matter for economic outcomes? A regression-discontinuity approach. J Eur Econ Assoc 6(5):1037–1056

    Article  Google Scholar 

  59. Rosenbaum PR, Rubin DB (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70(1):41–55

    Article  Google Scholar 

  60. Rubin DB (1978) Bayesian inference for causal effects: the role of randomization. Ann Stat 324:34–58

    Google Scholar 

  61. Sales A, Hansen BB (2014) Limitless regression discontinuity: causal inference for a population surrounding a threshold. arXiv:1403.5478

  62. Shi W, Lee L-F (2018) A spatial panel data model with time varying endogenous weights matrices and common factors. Reg Sci Urban Econ 72:6–34

    Article  Google Scholar 

  63. Terrier C, Ridley MW (2018) Fiscal and education spillovers from charter expansion

  64. Thistlethwaite DL, Campbell DT (1960) Regression-discontinuity analysis: an alternative to the ex post facto experiment. J Educ Psychol 51(6):309

    Article  Google Scholar 

  65. Van der Klaauw W (2002) Estimating the effect of financial aid offers on college enrollment: a regression-discontinuity approach. Int Econ Rev 43(4):1249–1287

    Article  Google Scholar 

  66. vom Hofe R, Parent O, Grabill M (2019) What to do with vacant and abandoned residential structures? The effects of teardowns and rehabilitations on nearby properties. J Reg Sci 59(2):228–249

    Article  Google Scholar 

  67. Wong SK, Yiu C, Chau K (2013) Trading volume-induced spatial autocorrelation in real estate prices. J Real Estate Finance Econ 46(4):596–608

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Gary Cornwall.

Additional information

G. Cornwall: We would like to thank the Taft Research Center for providing generous research support. The authors would like to acknowledge James P. LeSage, Olivier Parent, Jeff Mills, Huiben Weng, Justine Mallatt, Marina Gindelsky, Scott Wentland, and participants of the Midwest Econometrics Group, Southern Regional Science Association Conference, and Southern Economic Association Meetings for their helpful comments. The views expressed here are those of the authors and do not represent those of the U.S. Bureau of Economic Analysis or the U.S. Department of Commerce. First version 05/19/2017.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cornwall, G., Sauley, B. Indirect effects and causal inference: reconsidering regression discontinuity. J Spat Econometrics 2, 8 (2021). https://doi.org/10.1007/s43071-021-00014-3

Download citation

Keywords

  • Bayesian
  • Durbin models
  • Regression discontinuity

JEL Codes

  • C01
  • C11
  • C3
  • C31