1 Introduction

The number of available methods for causal inference has seen enormous growth in the last three decades (see Abadie and Cattaneo 2018, for a recent overview). Although progress has been tremendous, applied research does not always keep up. On the one hand, flexible semi- and non-parametric methods based on the propensity score (PS) are widely applied when estimating causal effects under the selection-on-observables assumption (e.g., see Austin and Stuart 2015; Thoemmes and Kim 2011). On the other hand, much of the literature using instrumental variable (IV) estimation to overcome bias due to unobserved factors relies on two-stage least squares (2SLS). This is despite the fact that 2SLS may yield inconsistent estimates of treatment effects when effects are heterogenous and covariates are predictive of the instrument (Abadie 2003).

This paper concentrates on the most common case in which the researcher aims to estimate causal effects of some treatment using a single binary IV. First, the paper reviews available results from the literature on implications of using standard linear-in-covariates 2SLS estimation under effect heterogeneity. Most importantly for this paper, the literature shows that 2SLS yields a ratio of conditional variance-weighted average of covariate-specific effects. If effects are indeed heterogenous and related to the PS, then 2SLS yields inconsistent estimates. Estimators based on the PS provide a consistent (Frölich 2007), readily-available and intuitive alternative. Hence, the paper briefly describes some basic IV estimators using the PS as well as the novel efficient covariate balancing approach by Heiler (2021). By re-estimating the returns to college using these approaches—exploiting college proximity as an instrument using the data by Card (1995)—the paper shows that the threat of obtaining inconsistent estimates when using 2SLS is not merely hypothetical. 2SLS yields systematically larger effect estimates than more flexible estimators based on the PS. Further inspection shows that this difference is mainly due to the implicit conditional-variance weighting performed by 2SLS.

This case study has been widely used to teach Economics students around the world about the use of IV methods to overcome bias due to unobserved confounders as well as the importance of effect heterogeneity. Moreover, the case study has been widely used in a variety of papers, see for example Tan (2006), Huber and Mellace (2015), Kitagawa (2015), Mourifié and Wan (2017), Andresen and Huber (2021), Sloczynski (2021), Sloczynski et al. (2022) and Blandhol et al. (2022). Most of these papers are concerned with instrument validity, an issue that is discussed but not of main interest in this paper. The only study known to the author that compares parametric estimators of the returns to college with more flexible estimators for this case study is Sloczynski et al. (2022). They too find sizable gaps in estimates. However, in contrast to this paper, they do not offer an explanation for this phenomenon.

The remainder of this paper is organized as follows: Sect. 2 reviews identification and estimation using IVs, Sect. 3 applies 2SLS and comparison methods based on the PS to the data. Section 4 concludes.

2 Identification and estimation using Instrumental variables

Assume we have an i.i.d. sample for i = 1,…,N units, where for each unit we observe some exogenous characteristics \({\text{X}}_{\text{i}}\), a binary treatment variable \({\text{D}}_{\text{i}}\), an outcome \({\text{Y}}_{\text{i}}\) and a single binary instrument \({\text{Z}}_{\text{i}}\). Furthermore, assume that there is an unobserved confounder \({\text{U}}_{\text{i}}\) that has an impact both the treatment variable \({\text{D}}_{\text{i}}\) and the outcome \({\text{Y}}_{\text{i}}\). In the language of classical least squares regression, this creates an omitted variable bias and the selection-on-observables assumption fails (Wooldridge 2010, Chap. 4). To stick to the returns-to-college example used throughout this paper, conditioning on observed characteristics such as labor market experience or region of residence is insufficient to remove bias from standard regression or matching estimates of the effects of college attendance on wages if unobserved ability has an impact on the college decision and labor market earnings (Blackburn and Neumark 1993).

Under such circumstances, one can use IV techniques to estimate causal effects by exploiting variation in the treatment variable \({\text{D}}_{\text{i}}\) through the instrument \({\text{Z}}_{\text{i}}\). When effects are heterogenous, IV methods identify local treatment effects, i.e. average effects for specific sub-populations influenced by the instrument. For this identification result to hold, the instrument needs to be exogenous, i.e. the instrument has to be as good as randomly assigned after conditioning on covariates and there must not be a direct effect of the instrument on the outcome. Moreover, the instrument must influence the treatment decision in a monotonous way. Imbens and Angrist (1994) introduce what Sloczynski (2021) calls “strong monotonicity”, which is the assumption that the instrument weakly increases or decreases the treatment probability for everyone. Under this assumption, IV methods identify the local average treatment effect (LATE, Imbens and Angrist 1994), also called the complier average causal effect (CACE), i.e. the average treatment effect of individuals who act in line with the instrument. If defiers, i.e. individuals who act in the opposite direction of compliers, exist, and one is willing to assume that the sign of the effect of the instrument on treatment is determined solely by covariates (“weak monotonicity”, Sloczynski 2021), the CACE may be recovered by averaging effects for individuals with covariate values estimated to behave in the direction of compliers. Moreover, a more general effect, the mover average causal effect (Kolésar 2013), i.e. the average treatment effect for compliers and defiers, is identified.

Using the standard potential outcomes framework (Imbens and Angrist 1994; Rubin 1974), define \({\text{D}}_{\text{i}}\left(1\right)\)and \({\text{D}}_{\text{i}}\left(0\right)\) as the potential treatment states if the unit was assigned \({\text{Z}}_{\text{i}}=1\) or \({\text{Z}}_{\text{i}}=0\). If the instrument indeed has no direct impact on the outcome, one may write potential outcomes as \({\text{Y}}_{\text{i}}\left({d}_{i}\right)\), with \({\text{Y}}_{\text{i}}\left(1\right)\)and \({Y}_{i}\left(0\right)\) being the outcomes that would be observed under treatment and without. Assuming the instrument raises the chance of receiving treatment on average, the strong monotonicity assumption implies that for compliers \({D}_{i}\left(1\right)>{D}_{i}\left(0\right)\), i.e. they receive treatment if assigned \({\text{Z}}_{\text{i}}=1\)and they do not if assigned \({\text{Z}}_{\text{i}}=0\). Based on these definitions and assumptions, the standard CACE can be written as

$${{\Delta }}^{CACE}=E\left[{Y}_{i}\left(1\right)-{Y}_{i}\left(0\right)|{D}_{i}\left(1\right)>{D}_{i}(0)\right] .$$
(1)

The MACE is defined as \({{\Delta }}^{MACE}=E\left[{Y}_{i} \left(1\right)-{Y}_{i} \left(0\right)|{D}_{i}\left(1\right) \ne {D}_{i}(0)\right]\) and can be recovered by using a reordered instrument \({\text{Z}}_{\text{i}}^{\text{R}}\), i.e. an adapted instrument which is reversed for defiers, defined asFootnote 1

$${Z}_{i}^{R}={Z}_{i}I\left({\delta }^{D}\left({X}_{i}\right)\ge 0\right)+\left(1-{Z}_{i}\right)I\left({\delta }^{D}\left({X}_{i}\right)<0\right),$$
(2)

where \({\delta }^{D} \left({X}_{i}\right)=E\left[{D}_{i} \right(1)-{D}_{i} (0\left)\right|{X}_{i}]\) is the covariate-specific average effect of the instrument on the treatment decision and \(I(\cdot )\) is the indicator function (Sloczynski 2021).

For the empirical analysis of the paper, the exogeneity assumption is assumed to hold. Moreover, it is assumed that at least the weak monotonicity assumption holds as well. While the failure of the monotonicity assumption makes the interpretation of estimands difficult if not impossible, differences in estimates of these quantities are still interesting to inspect in order to understand the estimators’ behavior under effect heterogeneity.

For the following exposition of estimation methods, assume that strong monotonicity holds.Footnote 2 While Frölich (2007) shows that the CACE is non-parametrically identified under exogeneity and strong monotonicity, most applied research still uses 2SLS to estimate effects using an IV. Typically, researchers model the outcome and treatment equations as linear functions of the instrument and covariates. That is, they build regression models that look something like

$$Y_{i} = \alpha _{Y} + \beta _{Y}^{\prime } X_{i} + \gamma ^{Y} Z_{i} + \varepsilon _{i}^{Y}$$
(3)
$$D_{i} = \alpha _{D} + \beta _{D}^{\prime } X_{i} + \gamma ^{D} Z_{i} + \varepsilon _{i}^{D} ,$$
(4)

where it is (implicitly) assumed that slope-coefficients are constant and that \(\varepsilon _{i}^{Y}\) and \(\varepsilon _{i}^{D}\) are well-behaved error terms. The corresponding 2SLS estimator can be written as \({\widehat{{\Delta }}}_{2SLS}={\widehat{\gamma }}^{Y}/{\widehat{\gamma }}^{D}\), i.e. the ratio of the reduced form (3) OLS coefficient \({\widehat{\gamma }}^{Y}\) and the first stage (4) OLS coefficient \({\widehat{\gamma }}^{D}\) on \({Z}_{i}\). Under effect heterogeneity and standard regularity conditions, Sloczynski (2021) shows that \({\widehat{{\Delta }}}_{2SLS}\) converges toFootnote 3

$$\text{plim}{\widehat{{\Delta }}}_{2SLS}=\frac{E\left[{\delta }^{Y}\left({X}_{i}\right){\delta }^{D}\left({X}_{i}\right)Var\left({Z}_{i}\right|{X}_{i})\right]}{E\left[{\delta }^{D}\left({X}_{i}\right)Var\left({Z}_{i}\right|{X}_{i})\right]},$$
(5)

where \({\delta }^{Y}\left({X}_{i}\right)=E\left[{Y}_{i} \right(1)-{Y}_{i} (0\left)\right|{X}_{i}, {D}_{i} \left(1\right)>{D}_{i} \left(0\right)]\)is the average covariate-specific effect of treatment on the outcome for compliers. Hence, 2SLS yields a conditional-variance weighted average of covariate-specific effects for compliers. As \(Var\left({Z}_{i}\right|{X}_{i})=P({Z}_{i}=1\left|{X}_{i}\right)(1-P\left({Z}_{i}=1|{X}_{i}\right),\)weights attain a maximum when the PS \(P\left({Z}_{i}=1|{X}_{i}\right)=0.5\) (e.g., see Angrist and Pischke 2008). An important but typically under-appreciated consequence of this weighting is that \(\text{plim}{\widehat{{\Delta }}}_{2SLS} \ne {{\Delta }}^{CACE}\) if \({\delta }^{Y}\left({X}_{i}\right)\) and \({\delta }^{D}\left({X}_{i}\right)\) depend on the PS. Depending on the correlation structure at hand, this may lead to substantial inconsistencies when using 2SLS.

As an alternative, this paper considers IV estimators of \({{\Delta }}^{CACE}\)based on the PS as derived by Frölich (2007) as well as a recent extension by Heiler (2021).Footnote 4 These estimators do not restrict effect heterogeneity. As a consequence, they are consistent even when effects are not homogenous (Frölich 2007, and Heiler, 2021).

The IV-matching estimator based on the PS pairs up each unit from the groups defined by the instrument with one, multiple or weighted averages of units from the opposite group based on the PS in order to infer the missing counterfactuals. Let \(\widehat{{Y}_{i}\left(1\right)}\) and \(\widehat{{D}_{i}\left(1\right)}\)denote the estimated counterfactuals for the outcome and the treatment variable if the unit was assigned \({Z}_{i}=0\) as obtained by matching. Analogously, let \(\widehat{{Y}_{i}\left(0\right)}\) and \(\widehat{{D}_{i}\left(0\right)}\) be the estimated counterfactual outcome and treatment variable if the unit was assigned \({Z}_{i}=1\). Based on this definition, the IV-matching estimator can be written as

$${\widehat{\delta }}_{MAT}^{}=\frac{\sum _{i=1}^{N}{Z}_{i}\left({Y}_{i}-\widehat{{Y}_{i}\left(0\right)} \right)+(1-{Z}_{i})(\widehat{{Y}_{i}\left(1\right)}-{Y}_{i})}{\sum _{i=1}^{N}{Z}_{i}\left({D}_{i}-\widehat{{D}_{i}\left(0\right)} \right)+(1-{Z}_{i})(\widehat{{D}_{i}\left(1\right)}-{D}_{i})}.$$
(6)

To estimate the PS, a standard logit regression is used. Moreover, kernel matching (KM) is employed as it has been shown to be among the top-performing PS-based matching methods in several simulation studies under the selection-on-observables paradigm (e.g., see Frölich 2004; Busso et al. 2014). More specifically, the matching procedure is implemented using an Epanechnikov kernel with a bandwidth chosen via weighted cross-validation (Galdo et al. 2008). To avoid extrapolation, common support is imposed via the min-max criterion by Dehejia and Wahba (1999) as is standard in the PS-based literature (Caliendo and Kopeinig 2008).

The (un-normalized) inverse probability weighting (IPW) IV-estimator can be written as

$${\widehat{\delta }}_{IPW}^{}=\frac{\sum _{i=1}^{N}\frac{{{Z}_{i}Y}_{i}}{{\widehat{P}}_{i}}-\frac{{{(1-Z}_{i})Y}_{i}}{1-{\widehat{P}}_{i}}}{\sum _{i=1}^{N}\frac{{{Z}_{i}D}_{i}}{{\widehat{P}}_{i}}-\frac{{{(1-Z}_{i})D}_{i}}{1-{\widehat{P}}_{i}}},$$
(7)

where \({\widehat{P}}_{i}\) is an estimate of the PS \(P\left({Z}_{i}=1|{X}_{i}\right)\). IPW has been shown to be semiparametrically efficient in the IV context (Donald et al. 2014). To estimate the PS, two approaches are used. First, the same logit estimate as for KM is employed. To ensure better performance, weights of this estimator are normalized as un-normalized weights may yield unreliable results (Frölich 2004; Busso et al. 2014). Akin to KM, the min-max criterion is used to ensure common support. In the context of IPW, this sort of trimming may be even more important as IPW with PS close to zero or one may lead to invalid statistical inference when using the non-parametric bootstrap as is done for all estimators considered. See Heiler and Kazak (2021) for derivations and alternative bootstrap approaches or Sasaki and Ura (2022) for trimming based methods.

Second, as IPW methods may be overly sensitive to the specification of the estimated PS (Schafer and Kang 2008), this paper also uses the novel efficient covariate balancing (ECB) procedure by Heiler (2021) to estimate the PS. This approach specifies a loss-function tailored to the estimation of treatment effects using IVs and algorithmically minimizes covariate imbalances, leading to improved bias and variance properties in finite-samples compared to standard IPW methods (see Heiler, 2021, for details). This approach has several advantages. First, akin to IPW, ECB is semiparametrically efficient. Second, ECB is doubly-robust if covariates are specified flexibly and third, the ECB method tends to shrink the PS which may alleviate the need to implement heuristic trimming approaches such as the min-max criterion.Footnote 5 Due to the last property, IPW based on the ECB is implemented without further common support restrictions.

Ultimately, choosing an IV estimator involves a trade-off: Standard 2SLS is more easily applied than matching or weighting but 2SLS may be inconsistent under effect heterogeneity. Moreover, recent simulation evidence by Sloczynski et al. (2022) suggests that more flexible estimators may even be competitive in terms of mean squared error compared to standard 2SLS. However, more research on the relative performance of IV estimation methods under realistic data-generating processes is necessary to provide better guidance to researchers.

3 Re-estimating the returns to college exploiting college proximity

This Section provides empirical evidence on the relevance of potential inconsistencies in 2SLS estimates when an instrument is only valid conditional on covariates. This is done by re-estimating the wage returns to college exploiting college proximity as instrument using the data originally analyzed by Card (1995).

3.1 Data and descriptives

The data stem from the National Longitudinal Survey of Young Men, which interviewed men aged 14–24 in 1966 with follow-up surveys until 1981. The dataset contains information on 1976 log-earnings, years of education, and an indicator for growing up in a local labor market with an accredited 4-year college as well as covariates. The latter consist of potential experience, indicators for the 1966 census region, an indicator for being black, and living in the south as well as in an urban area in 1966 and 1976. Following Sloczynski (2021), a subset of the original data is analyzed with at least five observations in each covariate cell given by the interactions of the five indicators for being black, living in the south and in an urban area in 1966 and 1976. This restriction results in a sample of 2988 individuals instead of the 3010 originally analyzed by Card (1995).Footnote 6

The main idea of the instrumental variable set-up is that children who grew up near a college may live with their parents throughout their studies and thus face lower cost of post-secondary education, which should increase the likelihood of going to college independent of their ability. Accordingly, the treatment variable is defined as “some college”, i.e. having strictly more than 12 years of education.Footnote 7

Table 1 provides some select descriptive statistics for the sample, split by whether individuals grew up near a college (\({\text{Z}}_{\text{i}}=1\)) or not (\({\text{Z}}_{\text{i}}=0\)).

Table 1 Descriptive statistics

The descriptive statistics reveal quite sizable differences in terms of covariate distributions between groups defined by the binary instrument. The most-striking difference can be seen in the likelihood of living in an urban area: 80% of individuals who grew up near a 4-year college lived in an urban area in 1966, whereas the same is only true for 33% among individuals who grew up without a college nearby. Similarly, individuals who lived in the south in 1966 are under-represented among individuals who grew up near a 4-year college: of those who did (not) grow up near a 4-year college, 33 (60) percent lived in the south. Moreover, differences in racial composition and experience are also non-negligible. All of these differences are highly statistically significant as indicated by the small p values obtained from equality of means tests. As these variables tend to show quite strong associations with the outcome of interest, it is unlikely that the instrument is valid without conditioning on covariates and hence, an unconditional comparison of college attendance rates and log-wages across instrument groups is unlikely to be informative about the true effect of college attendance on earnings.

3.2 Specification and estimation methods

The returns to college will be estimated using two different sets of covariates. First, the main specification of Card (1995)—referred to as the baseline specification—will be used. This specification consists of potential experience in linear and squared form, indicators for the 1966 census region, an indicator for being black, and living in the south in 1976 as well as indicators for living in an urban area in 1966 and 1976. Second, following Sloczynski (2021), a saturated, i.e. fully interacted, specification based on the indicator for being black, living in the south in 1966 and 1976 and living in an urban area in 1966 and 1976 is used. Sloczynski (2021) adopted this flexible specification because Kitagawa (2015) provided evidence in favor of the validity of the instrument after conditioning on these covariates.

To estimate effects of college attendance on wages, the previously discussed methods are used. That is, naïve OLS and standard 2SLS and more flexible estimators based on the PS are applied. The latter consist of KM with an Epanechnikov kernel using a bandwidth chosen via weighted cross-validation (Galdo et al. 2008), IPW based on a logit estimate of the PS as well as ECB (Heiler 2021). When estimating effects based on the logit estimate of the PS, common support is imposed via the min–max criterion by Dehejia and Wahba (1999) as is standard in the PS-based literature (Caliendo and Kopeinig 2008).

As Sloczynski (2021) raises doubts about the validity of the strong monotonicity assumption, all estimators are also be applied using the reordered instrument. Following Sloczynski (2021), this adjusted instrument is obtained by estimating first stage effects non-parametrically for each covariate cell of the saturated specification and then reversing the instrument for individuals estimated to be defiers such that \({Z}_{i}^{R}\) encourages treatment for everyone. This changes the target parameter from CACE to MACE. In order to take care of this additional estimation step when performing statistical inference, standard errors are estimated using the non-parametric bootstrap not just for the PS-based estimators but also for 2SLS when using the reordered instrument. Standard errors are obtained using 999 replications, inference is based on the normal approximation.

3.3 Implementing matching and weighting

Before turning to actual estimates, it is imperative to check overlap and common support in terms of the PS as well as covariate balancing after matching or weighting (Caliendo and Kopeinig 2008). Figure 1 shows histograms of estimated PS with a bin size of 2.5%. Visual inspection suggests sufficient overlap between instrument groups, independent of the specification and estimation procedure used. Moreover, the PS distributions appear to be sufficiently bounded away from zero or one, which is important for the non-parametric bootstrap employed to be valid (Heiler and Kazak 2021; Sasaki and Ura 2022).Footnote 8 Applying the min-max criterion for KM and IPW based on the standard specification of the logit PS leads to the exclusion of 36 individuals. This equals roughly 1.2% of the sample and thus, one should not be overly concerned that estimated effects are no longer representative of the target estimand. Regarding covariate balance, Table 2 shows the pseudo-\({R}^{2}\) from a logit regression before and after matching or weighting for each specification used. All balancing approaches yield a substantial reduction in imbalance from around 20% to less than 1%. As intended, ECB delivers exact balance, independent of the specification used. Moreover, p values of likelihood-ratio tests suggest that after matching or weighting, covariates are no longer statistically associated with the instrument. Hence, these statistics suggest adequate covariate balance in order to move on to the outcome analysis.

Fig. 1
figure 1

Propensity score distributions. This figure shows histograms of estimated propensity scores using either a logit regression or the efficient covariate balancing procedure by Heiler (2021). The baseline specification consists of experience, experience squared and indicators for being black, living in the south, urban in 1966 and 1976 as well as census region of residence. The saturated specification consists of group dummies for the fully-interacted set of dummy variables for being black, living in the south in 1966 and 1976 as well as residence in an urban region in 1966 and 1976.

Table 2 Balancing

3.4 Comparing parametric and more flexible estimates of the returns to college

Focusing on the standard specification in the first two columns of Table 3, one can see that the 2SLS estimate of roughly 0.6 log-points is more than twice as large as the OLS estimate of about 0.24 log-points. This is in line with the findings of Card (1995) using multi-valued years of education as the treatment variable instead of a binary variable as in this case. Similar results are found using the saturated specification: 2SLS yields a point estimate of 0.57 log-points and the naïve OLS estimate is even smaller than when using the standard specification. Card (1995) attributes the sizable gap in estimates between 2SLS and OLS to possibly higher returns to education among individuals with a relatively poor background as they are the most likely to be induced to receive additional education by the instrument. This may explain why effects are expected to be larger, but estimates appear to be unreasonably large. Sloczynski (2021) argues that the large estimate may be caused by the existence of defiers. Indeed, his results—which are replicated in Table 3—show that when accounting for the existence of defiers by using the reordered instrument, the 2SLS estimate drops substantially to around 0.29 log-points. Nonetheless, the estimated effect is still considerably larger than the effect of roughly 20% suggested by other research on the returns to college (see for example Hoekstra 2009; Smith et al. 2020; Zimmerman 2014).

Table 3 Main Results

Turning to the more flexible estimates based on the PS in columns three to five of Table 3, one can see that matching and weighting estimators yield substantially smaller point estimates of the returns to college than 2SLS.Footnote 9 Estimates range from 0.28 to 0.32 log-points for the baseline specification. Estimates using the saturated specification are essentially identical due to their non-parametric nature, independent of whether KM or IPW with a logit or ECB PS is used. These estimates suggest a roughly a 0.27 log-point gain in wages from college attendance. If one uses the reordered instrument instead, matching and weighting estimates drop to roughly 0.2 log-points, which is fairly close to the estimates suggested by the literature. Furthermore, Table 3 shows that these smaller point estimates of returns to college are both due to smaller reduced form estimates as well as larger first stage effects when using PS-based estimators compared to 2SLS. Overall, the results suggest that the implicit conditional-variance weighting of 2SLS may have a substantial impact on resulting effect estimates when estimating the returns to college using college proximity. 2SLS estimates are somewhere between 50 and 100% larger than more flexible PS-based estimates.Footnote 10 These differences are rather sizeable, underscoring the potential value in using more robust PS-based estimators when estimating effects using an IV set-up.

3.5 Inspecting effect heterogeneity

To further illustrate the impact of the conditional-variance weighting by 2SLS, Table 4 compares 2SLS estimates with effect estimates using PS-based estimators as well as the estimates one would obtain if one weighted PS-based estimators with an estimate of the conditional variance of the instrument, i.e. mimicking the asymptotic behavior of 2SLS. This is done for the full sample as well as for two subsamples. For the sake of brevity, results are shown only for the saturated specification with the reordered instrument. Results for the other specifications—which are similar to the ones presented here—can be found in Tables 5 and 6 in the “Appendix”.

Panel A of Table 4 first replicates estimates for 2SLS and the PS-based estimators for the full sample found in Table 3: 2SLS yields an estimate of college returns of 0.289 log-points, PS-Based estimators suggest returns of 0.192 log-points. Conditional-variance weighted KM and the other PS-based approaches yield an estimate of 0.289 log-points, which is identical to the 2SLS estimate. Thus, in the fully saturated specification, the difference between 2SLS and more flexible estimators can be entirely attributed to the conditional-variance weighting performed by 2SLS. When using a non-saturated specification, this property breaks down. However, results still clearly show that variance weighting has a major impact on resulting estimates (see Table 5).

Table 4 Effect heterogeneity—saturated specification with reordered instrument

As pointed out by one of the reviewers, results by Sloczynski (2021) imply that 2SLS is expected to yield similar estimates to more flexible estimators when the (reordered) instrument groups are roughly of the same size, i.e. when \(\text{P}({\text{Z}}_{\text{i}}=1)\approx 0.5\) or \(\text{P}({\text{Z}}_{\text{i}}^{\text{R}}=1)\approx 0.5\). To inspect this implication, Panel B and C of Table 4 estimate effects for individuals who grew up an in urban environment with \(\text{P}\left({\text{Z}}_{\text{i}}^{\text{R}}=1|\text{u}\text{r}\text{b}\text{a}\text{n}\right)=0.79\) or in a more rural area with \(\text{P}\left({\text{Z}}_{\text{i}}^{\text{R}}=1|\text{r}\text{u}\text{r}\text{a}\text{l}\right)=0.44\). Indeed, 2SLS estimates are much more similar to PS-based estimates in the rural sample (0.342 and 0.347 log-points) than in the urban sample (0.249 and 0.137 log-points). Again, these differences are completely accounted for by the conditional-variance weighting. Hence, it appears that 2SLS is expected to yield estimates close to more flexible estimators when instrument groups are roughly equal size because the conditional-variance weighting plays less of a role in that case.

4 Conclusion

By re-examining the Card (1995) data on college proximity and the returns to college, this paper shows that potential inconsistencies in 2SLS estimates of local treatment effects documented in the theoretical literature are not merely a hypothetical threat when effects are heterogenous. For the data at hand, 2SLS yields systematically larger effects than more flexible estimators based on the PS with differences amounting to roughly 50 to 100%. It is shown that this is because standard linear-in-covariates 2SLS yields a conditional variance-weighted average effect, putting more weight on units with a PS close to a coin flip. In line with theoretical predictions by Sloczynski (2021), the results suggest that 2SLS estimates can be expected to be more trustworthy when sample shares of instrument groups are roughly of equal size. Moreover, the paper shows that this is because the effects of conditional-variance weighting tend to be less severe when groups sizes are similar. Overall, the results show that the presumption that 2SLS yields point estimates close to more flexible estimators based on the PS as argued by Angrist and Pischke (2008) does not apply in general and that one should be suspicious of 2SLS estimates when group sizes differ substantially and covariates are predictive of the instrument. In that case it may be best to use semi- or non-parametric estimation techniques instead. At the very least, one should use these methods to assess the sensitivity of estimates regarding implicit parametric assumptions made when using linear-in-covariates 2SLS.