1 Introduction

The euro represents a historic policy experiment. There is little precedent for such a large number of wealthy countries to multilaterally surrender control of their currencies. In their assessment of the challenges facing the eurozone in the depths of the European sovereign debt crisis, O’Rourke and Taylor (2013) show that most prior currency unions make for a poor comparison to the European Monetary Union (EMU). It should not then be surprising that estimates of the impact of the EMU on trade differ so dramatically from those studying these earlier unions. Assignment into the EMU, to borrow language from experimental studies: the policy treatment, is not remotely random. An important consideration is how accurately the control group (non-EMU country-pairs) compare on observable characteristics to those that receive policy treatment. In recent work, Rose (2017) surveys the literature of EMU trade effects, finding that the driving factor of whether a study finds large or small estimates comes down to sample choice. Rose (2017) advocates that more data should be better, which would suggest that the trade effect is big. I show that sample selection is indeed crucial to determining the size of EMU coefficients, but that studies which reduce the sampleFootnote 1 may at times offer an improvement in the comparability of these groups. I propose a data driven, propensity score, method of rebalancing the EMU and non-EMU samples to better match on observed characteristics. This not only provides an empirically driven way to select the sample, but also utilizes a robust weighting estimator that leverages gains from matching estimators, while still using well studied, and theoretically grounded, gravity equations of trade to estimate treatment effects.

The effect of currency unions on trade has been hotly debated. Empirical work stems from Rose (2000), whose gravity equation estimates suggest that a common currency more than doubles bilateral trade between countries. This was, by the author’s own admission, unreasonably large but surprisingly robust (Rose 2002). A cottage industry sprung up to minimize the currency union effect with Nitsch (2002) suggesting a weaker effect, and Glick and Rose (2002) showing that a smaller, though still extremely large estimate survives limiting estimation to the time-series within-pair variation. More recently attention has focused on estimating these effects for the euro-area, with Glick and Rose (2016) providing an update on their earlier currency union estimates to include the reasonably large time-series of euro data. They find an estimate of a roughly 54% increase in trade from EMU membership. This is much larger than the effect found by many others in the literature. A meta analysis carried out in Polák (2019) finds estimates of an EMU trade effect in the 2%-6% range, with more recent evidence pointing to little, if any impact at all. Rose (2017) suggests that much of the difference among EMU trade estimates comes from sample selection, with effects that are increasing as new years are added to the sample, and smaller when limiting the sample to subsets of rich countries (roughly 12% gains) rather than the full bilateral export data as in Glick and Rose (2016). Notably my preferred estimates are closely in line with Kelejian et al. (2012), who estimate the effect of euro on trade on a relatively homogenous sample and accounting for the spatial and persistent nature of trade.

Larch et al. (2019) reproduce the results of Glick and Rose (2016) using the Poisson pseudo-maximum likelihood (PPML) estimators suggested in Silva and Tenreyro (2006) and Silva and Tenreyro (2011), who show that log specifications of the gravity produce biased estimates in the presence of heteroscedasticity. Larch et al. (2019) show that this heteroscedasticity is particularly important in the context of EMU trade estimates. Notable for my work, they find that inclusion of small countries in the dataset produces sizeable impacts on estimates of the EMU effect, amplifying the difference between PPML and OLS results. They find, similar to Rose (2017), that estimates of the currency union and EMU effects vary substantially when altering the sample (OECD, Upper income, etc). My results will echo this conclusion, while contributing to their findings by providing a data driven method of sample selection, chosen to reduce the differences in observable characteristics between treated and control sub-populations. In addition to this, I find that the PPML estimator provides much more stable estimates when the underlying sample changes, further motivating its use relative to log-gravity estimators when problems of sample selection are large.

I am not the first to apply propensity score methods to the currency union effect on trade, nor even to the case of the EMU. Persson (2001) shows that using matching estimators to estimate the average treatment effect of currency unions significantly reduces their estimated impact. Similar matching estimators are used by Chintrakarn (2008) in the context of the euro area, again reducing estimators. Their work relies on estimation of selection into treatment groups, using logit and probit models of probability of being in a currency union (or the EMU) and then comparing the conditional mean of treated observations with that of control groups with similar likelihood of being treated. While these estimates work to solve the selection problem that is endemic in macro policy estimates of trade, they also fail to leverage the usefulness of the well studied gravity equations used in work such as Glick and Rose (2016). I instead use a doubly robust estimator, combining the propensity score weighting used in work such as Persson (2001) and Chintrakarn (2008) with calculations of the conditional mean using a gravity equation approaches that are standard to the trade literature. My methodolog is in a class of doubly robust estimators, which model both selection into treatment as well as the policy effect on outcomes. Such methods are described in detail in: Imbens (2004), Lunceford and Davidian (2004), and Wooldridge (2007); with applications in the context of macroeconomic policy (fiscal shocks) in Jordà and Taylor (2016).

Millimet and Tchernis (2009) applies my preferred estimator, inverse propensity score weighting with regression adjustment (IPWRA), to the EMU. Their work is primarily interested in providing guidance on specification of the first-stage model, showing that over-fitting such models can be beneficial. Their EMU application studies the period of 1999-2002 and focuses on 22 developed countries. They find an impact in line with similar panel OLS estimates, such as Micco et al. (2003), with a roughly 12% increase in trade due to EMU membership. My estimation updates their results to a much larger trade dataset, while contextualizing the use of the IPWRA as a method of improving on the problem of selection into EMU membership. This latter point is important as the IPWRA estimation procedure is not only a useful estimator, but through the first stage estimation provides a clean empirical way to demonstrate the appropriateness of the sample used in estimation, something ignored in much of the trade literature. Kopecky (2023) uses similar propensity score methods to show that disaggregated estimates of currency union trade effects.Footnote 2 At present sample choice across this literature appears to be completely arbitrary, with Micco et al. (2003), and many others, restricting their estimation sample to developed countries, while other authors make a similar appeal to that of Rose (2017)–more data is better.

The desire to use the largest amount of data is understandable. However, problems of selection bias in estimates of currency unions are well known. It was precisely this issue that motivated Persson (2001) to use matching estimators. While Rose (2001) makes compelling arguments for why the pure matching approach has flaws, the estimator I suggest utilizes the strengths of both these propensity scores and traditional gravity approaches. Knowing the potential issues of selection, increasing the sample size to include observations that have little in common with EMU members appears to bias estimates upwards. This works in the same way that including large pools of healthy, low risk, individuals as a comparison group may bias observational estimates on the efficacy of a medical treatment downward. Their better health makes them less likely to receive the treatment, but lack of treatment has no causal bearing on their health outcomes. The experimental ideal would use randomization to compare those receiving the medical treatment to individuals who need such treatment, but do not receive it. Of course this is not possible with observational data. It is precisely such contexts for which the IPWRA style estimators (and earlier matching estimators) were developed.Footnote 3 While I do not claim that this estimator removes all potential issues of selection in the context of the EMU, improving the comparability of treatment and control observations should only work to reduce such problems as much as possible given the observable characteristics available in standard bilateral trade data.

Beyond having an academic methodological interest, the answer to this question is critical. The global financial crisis highlighted many of the challenges and costs associated with eurozone memberhsip. Aizenman (2018) highlights some of these challenges for the euro and other developing market currency unions. The recent inflationary episode will likely prove another hurdle. Trade is only one benefit of the euro to member states, but it will be important for policy going forward to understand which of the diverse set of estimates of the trade impact of the euro and other currency unions accurately reflect the scale of that benefit. My work suggests trade impacts that are small, but still positive, while remaining statistically and economically meaningful.

2 Data and Methodology

I use the CEPII gravity dataset from Conte et al. (2021), constructing a measure of EMU currency union membership consistent with existing measures from Glick and Rose (2016), but excluding some small French territories that are included in their analysis.Footnote 4 In all results below, I drop non-EMU currency union pairs to ensure that my baseline comparison group is not polluted with countries currently in a different currency union pair that may similarly affect their trade. Comparing to Rose (2017), my full dataset includes more years (covering 1948-2019), but fewer country-pairs. When restricting to log-exports, as in their analysis, I have 29,394 country-pairs compared to their 34,104. The reduction in country-pair observations is in part because I drop non-EMU currency union observations, but also because the source data from Conte et al. (2021) has slightly different coverage than the IMF-DOTS used in their analysis. As a result, initial estimates using the Glick and Rose (2016) model are close-to, but slightly different from their results.

My starting point for estimation is the gravity specification suggested in Head and Mayer (2014). This is the “theory-consistent” method of specifying this empirical relationship, accounting for the multilateral resistance terms that are important in structural gravity equations. To accomplish this I include a full set of exporter-year and importer-year dummy variables. In addition to these, I include a time-invariant set of country-pair fixed effects, which have been shown by many to be important in the context of currency unions, and combined make up the preferred specification in Glick and Rose (2016). This is given by:

$$\begin{aligned} ln(X_{ijt}) = \gamma EMU_{ijt} + \beta RTA_{ijt} + \lambda _{it} + \psi _{jt} + \phi _{ij} + \epsilon _{ijt} \end{aligned}$$
(1)

where \(X_{ijt}\) are exports from country i to country j at time t, \(EMU_{ijt}\) is a dummy variable representing a EMU membership for a country-pair in year t, \(RTA_{ijt}\) is control for regional trade agreements, \(\lambda _{it}\) exporter-time fixed effects, \(\psi _{jt}\) importer-time fixed effects, and \(\phi _{ij}\) time-invariant country-pair fixed effects. While I can include a large set of standard gravity controls, nearly all are inestimable while using exporter/importer-year and pair fixed effects, so I omit them from discussion here. Including the few with enough variation to remain in this specification have no bearing on estimates of \(\hat{\gamma }\).

Silva and Tenreyro (2006) provide a now well-known critique of Eq. (1), showing that under heteroskedasticity, the log-linearized model of trade leads to biased estimates. They show in both Silva and Tenreyro (2006) and Silva and Tenreyro (2011) that estimations of Eq. (1) on trade data suffer substantially from this bias, suggesting instead to use a Poisson pseudo-maximum likelihood (PPML) estimator without taking logs. This methodology has the advantage of not only dealing with the bias introduced by the log-linear approximation, but also allows for inclusion of zero trade flows. While I will continue to use the log gravity specification to link my work to the results of Rose (2017). I also provide estimates of the equivalent PPML estimation, of:

$$\begin{aligned} X_{ijt} = e^{\gamma EMU_{ijt} + \beta RTA_{ijt} + \lambda _{it} + \psi _{jt} + \phi _{ij}} + \epsilon _{ijt} \end{aligned}$$
(2)

where controls and fixed effects are defined identically to those above.

2.1 Inverse Propensity Score Weighting: A Doubly Robust Estimator

Propensity scores feature heavily in the rigorous debate around the size of the currency union effect on trade. In his original rebuttal of the eye-popping estimates of Rose (2000), Persson (2001) showed that matching estimators substantially reduce the impact of the currency union effect on trade. His methodology uses two estimators: nearest-neighbor matching, and stratification. Both are two-step estimators that rely on a first stage estimate of the probability of selection into a currency union, and then generate estimates for the average treatment effect using a difference in means among these groups. The nearest neighbor method pairs each treated observation with its most comparable control group, while stratification bins treated and control observations according to their probability of treatment and takes a weighted difference in means within each bin. The goal of these methods is to deal with concerns of selection, well understood in the context of currency unions and the euro area.Footnote 5 Similar methods are used in a more recent estimate of the euro area in Chintrakarn (2008), who again suggests a downward revision of eurozone estimates after applying matching estimators.

In an initial response to Persson (2001), Rose (2001) correctly critiques the probit model used for selection into currency unions as a poor fit for the data. Histograms of treated (currency union) probabilities reveal that most of the weight of predicted likelihood of being in a currency union falls near bottom of the distribution. Matching estimators such as those used in Persson (2001) rely on the identification of the model of selection into treatment for the second stage difference-in-means estimators. Identification hinges on converting observational data into something closer to a randomized control trial under the assumption that the rebalancing of treatment and controls will ameliorate selection issues. However, the lack of fit and out-of-sample predictions of currency unions should give caution to those relying on the soundness of this identification, especially when comparing their results to the well studied, and theoretically motivated, gravity equation. The case that Rose (2001) makes against their use is rather convincing, particularly when comparing to the modern specifications of gravity that control for time-varying country-specific multilateral resistance terms, as Glick and Rose (2016) does.

However, such critiques of the chosen matching estimators do not address the concerns of selection that motivated the work of Persson (2001) and Kenen (2002). Perhaps it is possible to have our cake and eat it too? A different class of estimators use propensity score matching to estimate doubly robust estimators, that replace the difference-in-means second stage with other second stage estimates of conditional means. I make use of inverse propensity score weighting with regression adjustment (IPWRA). This process involves specifying a first-stage treatment estimation of selection into treatment (the EMU), but instead of a difference in means approach, I then estimate a second stage model using regression specifications with propensity scores serving as weights. This allows for estimates to be rebalanced using first stage estimates similar to that of Persson (2001), while still using the well studied gravity equations to estimate conditional mean across treated and control sets. The doubly-robust nature of the IPWRA estimator, is shown and discussed in great detail in work such as Imbens (2004), Lunceford and Davidian (2004), and Wooldridge (2007), and suggests that the estimator will yield consistent estimates of the average treatment effect if either model is correctly specified. I follow similar notation to Jordà and Taylor (2016), who develop this estimator in a macroeconomic context.Footnote 6 The IPWRA estimator is given by:

$$\begin{aligned} \widehat{ATE}_{IPWRA}=\frac{1}{n^{*}_{1}}\sum _{}\left[ \frac{EMU_{ijt}\left( m_{1}\left( Z_{ijt},\hat{\beta }\right) \right) }{\hat{p}_{ijt}}\right] - \frac{1}{n^{*}_{0}}\sum _{}\left[ \frac{(1-EMU_{ijt})\left( m_{0}\left( Z_{ijt},\hat{\beta }\right) \right) }{1-\hat{p}_{ijt}}\right] \end{aligned}$$
(3)

where \(EMU_{ijt}\) is a dummy representing whether a country-pair is in the eurozone at time t, and \(\hat{p}_{ijt}\) is the first-stage estimate of probability of treatment. A general term for any second stage estimate of conditional means for treated (1) and control (0) groups is given by: \(m_{0/1}\left( Z_{ijt},\hat{\beta }\right)\).Footnote 7 I estimate this conditional mean using Eqs. (1) or (2), where \(\hat{\beta }\) now represents the vector of all estimated regression coefficients. Estimates of \(\hat{\beta }\) can in principle be estimated separately for treated and control populations, as described Sloczyński and Wooldridge (2018), however this is not possible while using the fully specified theory consistent gravity equation described by Head and Mayer (2014), as specifying the conditional mean across treatment and controls requires a high dimension of fixed effects, larger than the number of EMU observations in the sample. As such, I am limited in this application to the assumption that coefficients for estimation of these conditional means are identical, which is of course also used in traditional gravity specifications. In keeping with suggestions of made in Hirano and Imbens (2001) and Imbens (2004) these weighted averages are normalized, with \(n^{*}_{1} = \sum \frac{EMU}{\hat{p}}\) and \(n^{*}_{0} = \sum \frac{1-EMU}{1-\hat{p}}\), to ensure that the probability weights are normalized to sum to one.

In practice, estimation of Eq. (3) is quite straightforward. Upon obtaining first stage estimates of \(\hat{p}\) inverse propensity score weights are assigned as \(\frac{1}{\hat{p}}\), to treated (EMU) observations, and \(\frac{1}{1-\hat{p}}\) to controls (non-EMU). The normalization described above simply divides these weights by the sum of all weights to ensure that they sum to one. The second stage requires any conditional mean estimate \(m_{1/0}(Z_{ijt},\hat{\beta })\), which for my purposes is the coefficient estimate of the EMU in the gravity equation. Note that I assume that \(\hat{\beta }\) is the same for treated and control groups. This assumption allows me to run a single gravity equation with the EMU dummy using the inverse propensity scores as weights.Footnote 8 It is possible to allow for separate estimation of the conditional means, (ie \(m_{1}(Z_{ijt},\hat{\beta }_{1}), m_{0}(Z_{ijt},\hat{\beta }_{0})\)). Doing so would mean estimating the gravity equation separately on EMU/non-EMU observations and then weighting the difference of conditional means of the outcome.Footnote 9

In the related literature on free trade agreements, Baier and Bergstrand (2007) discuss the difficulties of estimating the causal effects using instrumental variable and control function approaches in panel data, suggesting that using panel fixed effects methods analogous to Eq. (1) should address many of the endogeneity concerns plaguing this literature. In the context of the euro effect on trade, the wide range of estimates documented in meta analysis such as Rose (2017) and Polák (2019) using similar panel fixed effects models suggests there are likely unaddressed endogeneity concerns here. One contribution of my work is to introduce the “doubly-robust” estimator in Eq. (3), which have been widely used elsewhere, to the trade literature. This provides a means of combining the potential benefits of the theory-consistent gravity approachFootnote 10 with information gained from estimates of selection into the EMU itself. Indeed, in later work, Baier and Bergstrand (2009) use traditional matching estimators for the effect of regional trade agreements. This method could combine such estimates with the gravity equation approach in Baier and Bergstrand (2007) to protect against mis-specification of either model. If the panel fixed effects model is correctly specified, then this approach should not affect results.

2.2 Modeling Selection Into Currency Unions

I estimate selection into currency unions using a logistic specification controlling for a wide range of country and pair-specific controls. I include in these a number of standard gravity equation variables, as well as other characteristics that may be important for the formation of bilateral currency union agreements. My first stage specification is given in Eq. (4).

$$\begin{aligned} {\begin{matrix} EMU_{ijt} = \theta _{0} &{} + \theta _{1}\ln Y_{it}\times Y_{jt} + \theta _{2}\ln y_{it}\times y_{jt} + \theta _{3}\ln Dist_{ijt} \\ &{} + \theta _{4}\ln |Y_{it} - Y_{jt}| + \theta _{5} \ln |y_{it} - y_{jt}| + \theta _{6}\left| g_{it} - g_{jt} \right| \\ &{} + \theta _{5}Population_{it} + \theta _{6}Population_{jt} + \theta _{7}Z_{ijt} + \epsilon _{ijt} \end{matrix}} \end{aligned}$$
(4)

In Eq. (4), in addition to the first three terms, which are standard gravity equation estimates of size of output (\(Y_{it/jt}\)), incomes (\(y_{it/jt}\)), and distancesFootnote 11 (\(Dist_{ijt}\)) between the two countries, I also include the differences in output, per-capita GDP, and GDP growth rates (\(g_{i/j}\)). These are important because the standard gravity terms may do a poor job capturing the fact that many trading partnerships occur between relatively rich and poor (or large and small) economies, in ways that may be systematically different for the average eurozone economy. Further, I will include in my baseline specification a number of additional controls, including population of origin and destination countries, as well as a rich set of pair specific binary characteristics commonly used in the gravity literatureFootnote 12 captured by \(Z_{ijt}\). These controls may be important for capturing the various geopolitical motivations for forming a currency union. Moreover, Brookhart et al. (2006) shows that inclusion of variables related to the outcome of interest, even if unrelated to first stage treatment, decrease the variance of the estimated first stage without increasing bias. Further, Millimet and Tchernis (2009) show via Monte Carlo simulation that there are potential benefits of over-specifying the propensity score estimator, so I include squared terms of each of the continuous regressors in Eq. (4) along with their linear terms.

The first stage specification outlined above allows both weighting and sample truncation, but remains somewhat ad hoc. My emphasis, as I will show below, is on improving the comparability between treated and control observations in a way that is data driven, and that also easily links to the existing estimates on the euro trade effect. While I limit myself to factors commonly observed in trade data, other models using more detailed political/historical/demographic data may provide a better fit for the first stage selection process and improve upon the validity of my estimators. In part I wish to demonstrate, using widely available trade variables how differences already documented in Rose (2017) are closely related to the comparability of control groups.

3 First Stage Estimation and Sample Selection

To generate average treatment effects of the eurozone given by the IPWRA estimator of Eq. (3), I must first specify the probability of selection into the EMU given by a first stage estimation of Eq. (4). I report first stage estimates for a number of models, beginning first with one that simply uses traditional gravity equation estimates, then a specification that uses all of the controls described in the discussion of Eq. (4) excluding second order terms, then adding these second order controls to estimate my baseline specification. I consider two extensions of this baseline, first adding regional trade agreements and GATT membership at origin an destination, then adding to this European Union membership. The results of this first stage estimation is reported in Table 1.

Table 1 First Stage Models

The model generally finds sensible coefficients with EMU members having positive coefficients for log product of GDP and GDP per-capita, suggesting they have larger output and are richer than the average pair in the sample, but also negative terms for their absolute differences, which suggests they are closer in relative economic size and well being than the average trading partners. In Table 1 I exclude the non-linear squared terms present in models 3-5 to conserve space. The extended models (in columns 4 and 5) drop the vast majority of data as RTA and EU membership are both perfect predictors of “failure” (ie all EMU members have this attribute). Thus inclusion of one, or both, drops all zero observations for this variable.

Because fit is improved and Millimet and Tchernis (2009) suggests that IPWRA estimators may perform better with an over-specified model I will choose to use probabilities estimated using both the linear and squared terms of the variables in Eq. (4). Since I wish to compare my estimators to the log gravity estimates of Rose (2017), I keep baseline weighted estimate that is relatively close in sample size and will thus use the model from column 3 in Table 1 for my baseline estimates below. As I will show below, the estimates on my sample with non-zero predicted propensity scores in this model are quite similar to those in Rose (2017), making for a convenient baseline comparison for the rest of my results.

With such large data, and euro membership quite rare, the model struggles to fit this first stage particularly well, with the mean EMU member still having probability of joining at just 36.8%. A flaw of relying entirely on the strength of the matching estimator, as in work such as Persson (2001) and Chintrakarn (2008) is that these models have weaker fit, and less theoretical justification, than the gravity equation. An important strength of the IPWRA estimator is that it uses information from this first stage to rebalance treatment and control observations, while also relying on the gravity equations in 1 or 2 to calculate the conditional means across those groups. The first stage estimates in columns 4 and 5 do a better job identifying the factors that determine EMU membership, much better in the specification that drops non EU members. The inclusion of European Union membership as a control increasing the mean predicted probability of EMU members to 0.61. While I will not use these estimates in my baseline specification, as they are difficult to compare directly to work such as Glick and Rose (2016) that uses larger datasets, I will show that such estimation may improve comparability, but that estimates are similar ot my preferred specification using the baseline first-stage.

It is important that there is sufficient overlap in propensity score estimates, such that there are comparison observations across treatment and control groups. While the majority of weight of control population is near zero, I show in Fig. 1 that there is overlap across the full distribution of treated observations in my baseline (column 3) first stage model. Although I use separate axes for readability, there are more non-EMU observations with a probability greater than the mean of 0.368 than EMU observations.

Fig. 1
figure 1

Overlap: K-density plots of treated and control propensity scores

Truncating the sample to only include observations along the [0.001, 0.893] distribution of the EMU observations results in a sample of 34, 971 (there are 4,690 pair-year observations between euro members). Such truncation is commonly used to limit the impacts of outliers. As can be seen in Eq. (3), EMU observations that are predicted to be very likely to be non-EMU and non-EMU observations with high predicted probability of being in the EMU can receive high weights at the extreme tails of the probability distribution. Many estimates making use of IPWRA and other propensity score estimates consider truncating the sample to limit such influence. Considering the weighting in Eq. (3) the threat is that very low probability treated observations (those that look much more like controls) and very high probability controls (which look like treated) will receive potentially large weights. This often done using an arbitrary rule-of-thumb method such as 1% or 5% cutoffs. Imbens (2004) suggests that the potential threat of outliers shrinks with sample size so that the potential influence of low/high probabilities, providing \(1/\left[ N\times (1-\hat{p}_{max})\right]\) and \(1/\left[ N\times (\hat{p}_{min})\right]\) as bounds for the influence of observations at the top and bottom of the probability distribution. Using the sample that lies along the range of the treatment observations ([0.001, 0.893]) limits the potential effect of outlier treatment and control impact on the estimator to less than 3%, well within the bounds used in other studies.

More importantly in this context and for the discussion of proper choice of sample in Rose (2017), it would appear in large trade datasets that the opposite problem to that of influential outliers may be problematic for the Eq. (3). Rather than being driven by small(large) treatment(control) outliers, estimates on the full sample are strongly influenced by the large number of Non-EMU observations with extremely low probability of treatment. These receive smaller weights in Eq. (3), but the weighting penalty only slightly diminishes this group’s out-sized effect on the estimators, as I will show. This problem is larger in unweighted estimates of Eqs. (1) and (2). Another way this can be shown is by censoring p-scores (keeping the observation but setting an upper/lower threshold for their weights), which has little effect relative to IPWRA estimates on the full data. The fact that truncating these observations changes their results dramatically, while assigning arbitrary lower/upper probability limits on the full sample does not, implies that the large weight of low probability observations are altering estimates in spite of their low weights, not because of them. For clarity of exposition I do not include such censored results below.

I agree with Rose (2017) that sample choice is critical in the estimation of EMU trade effects. Before presenting estimates of the EMU I show that how the sample is selected points towards preferring estimates made on a smaller subset of data. I consider five samples, built in various ways using the propensity scores from column 3 of Table 1. These are: the full sample including zero and missing trade flows,Footnote 13 the full sample with non-zero trade (ie using the log specification), that for which predicted probabilities of treatment are non-zero (ie the largest sample where the IPWRA is estimable), a truncated sample with probabilities limited to the range of the treatment group, and finally an ad hoc sample of upper income countries. This last group is consistent with the definition in Rose (2017), and includes only origin and destination countries with real GDP per-capita greater than \(\$12,736\).

Figure 2 shows the difference in means between EMU (treatment) and non-EMU (control) observations across each of these samples for many of the controls used in Eq. (4). I provide both the unweighted and weighted differences in means using propensity score estimates when possible. In an ideal situation, such as a properly run randomized control trial, these groups should be identical along observable characteristics and these differences should be zero. This exercise highlights the value that the first stage propensity score process has in improving the comparability of these two groups.

Fig. 2
figure 2

Difference in Means (Treatment - Control), Various Controls

Figure 2 shows that the set of EMU countries are not remotely comparable to the mean control in the full sample. The first and third graph suggest that eurozone partner GDPs and GDP per-capita are substantially larger than the mean in the sample. The absolute value differences in the second, fourth, and fifth sub-figures reflect that fact that EMU trading partners are more diverse in terms of relative output and income than the average pair, but on much closer growth trajectories. Unsurprisingly EMU members are much closer together geographically, they also have relatively smaller populations, are less likely to share a common (official) language, and much more likely to have a common religion.

The absolute difference in means for the full sample is nearly always largest,Footnote 14 with the non-missing trade already improving the comparability between the two groups. However in the full log sample, which is smaller, but of a similar magnitude and composition to that used in Glick and Rose (2016), differences are still extremely far from zero. Because these differences are precisely estimated, many of the error bands are not visible, but they are strongly significant. The improvement when moving from full to non-zero/missing trade is somewhat intuitive given that these flows are concentrated in small economies. Comparing the largest p-score samples, weighting improves comparability, but in some cases only marginally. Truncating the sample to include only the non-EMU observations that fall within the range of estimated p-scores of EMU pairs brings these much closer to zero, with weighting further closing the remaining gap for all but one (difference in GDP per capita) control. Though these small differences are in some cases still significant they provide a much stronger comparability between the EMU sample and the comparison group used in estimation.

It is interesting to compare these model driven sample means with the ad hoc, “upper income” sample used in Rose (2017). For the control used to restrict the sample (GDP per capita) this actually outperforms the comparability made using propensity scores. The model in Eq. (4) works to minimize gaps along all of these variables, many of which substantially outperform the ad hoc measure, at the expense of fitting the per-capita GDP comparisons as closely. The final sample, which combines the ad hoc weights with the logistic model propensity scores does improve the fit (though not generally as well as the p-score trucation), but I caution that this actually further truncates the sample as a non-trivial share of the upper-income countries were allocated scores of zero and thus drop out when the inverse-propensity score weights are included. Authors wishing to use such ad hoc sample reduction techniques should check that it improves the balance of treated/control observations, and consider whether a data driven approach might futher improve them.

One can consider a reduction in sample using IPWRA estimates as a data-driven approach to the sample selection, providing the best possible comparability between treated and control, conditional on observable characteristics, while also improving comparability through the weighting procedure itself. Of course, such attempts at constructing a re-randomized treatment and control set ex-post are only as reliable as the observable characteristics used in the first-stage selection model. It is still possible that unobserved characteristics not captured in this process may bias estimates. In the doubly robust estimator used in Eq. (3), or estimates that rely on the propensity score model to only truncate the sample, this only creates a problem if they are also not captured by the rich set of origin-year, destination-year, and dyadic fixed effects used in the second stage gravity equation. While this is still potentially true, my results should improve upon the existing estimates of the EMU treatment effect in the aggregate trade data by providing a nearly balanced treatment and control group prior to estimating the marginal effect using conventional gravity measures.

4 Results: IPWRA Gravity Estimates of Trade

I now present results for Eqs. (1) and (2) on the samples described above, showing when possible the improvements made by weighting as in Eq. (3). Because I wish to understand which changes come from weighting and which come from sample selection, I will include the OLS/PPML without the IPWRA estimate for each of these sub-samples. I begin with the OLS estimates of the log gravity specification in Table (2).

Table 2 Log Gravity Estimates

The estimate using the full sample is larger, at 0.48 than the 0.43 coefficient from Glick and Rose (2016).Footnote 15 This effect is essentially unchanged when running the same specification on the smaller sample that omits propensity scores assigned an exact zero by the first stage model.Footnote 16 This is perhaps not too surprising given that these samples have fairly similar differences in means across treatment and control in Fig. 2. When applying the IPWRA estimator to weight the regression outcomes by probability of selection this estimate is reduced to 0.40, but remains large and quite close to the Glick and Rose (2016) estimate for the EMU. Estimates on the smaller, truncated, sample using only observations within the range of the propensity scores observed for the EMU, (in the range [0.001, 0.893]) decrease these effects substantially to 0.076. On the truncated sample, the weighted coefficient is close to, but roughly a standard error below, the unweighted estimate. The propensity score is still important for the unweighted sample given that it was used to select inclusion, but this selection appears to matter much more than the weighting. For the ad hoc Upper income sample, I estimate a 0.13 EMU effect, quite similar to the equivalent estimate of 0.11 in Rose (2017). Estimates using this sample selection are nearly twice as large as those on the truncated propensity score sample. Adding the propensity score weights to this sample, to estimate the IPWRA treatment effect, reduces this effect to be indistinguishable from zero, though notably there is a large fraction of the this sub-sample that have zero estimated propensity scores, and therefore drop out of this estimation. Repeating the estimation using the sample in column 7 without weighting provides an estimate of 0.03, so as with the truncated sample most of this difference appears to come from the selection effect of dropping observations with extremely poor fit in the first stage model, rather than the weighting estimator itself.

Repeating this exercise for the PPML estimates of the gravity equation provide substantially different results, and are shown in Table 3. The first column uses the full sample, assuming zero trade flows for missing values where the macroeconomic controls (GDP and GDP per capita) are non-missing.Footnote 17 The resulting estimate of the EMU effect is 0.085, larger than the equivalent estimate of 0.052 in Larch et al. (2019), which again I attribute to using a different sample and not jointly controlling for other currency unions (rather dropping their observations). Now when reducing the sample to those with positive propensity scores this falls only slightly to 0.069, and again to 0.05 with the truncated propensity score sample. These estimates for the PPML are all quite close to those from the log-gravity estimation using the truncated sample.

Table 3 PPML Gravity Estimates

While my preferred truncated sample IPWRA estimates remain robust in this PPML estimation using heteroskedastic-robust standard errors, Egger and Tarlea (2015) suggest using multi-way exporter, importer, and year clustering in gravity equation estimations. Consistent with the PPML results of Larch et al. (2019), who cite both Huber/White and multi-way clustered errors of this type, the significance for all of my results using the PPML estimator fail to reach even a \(p < 0.15\) threshold when implementing these standard errors. I report these more forgiving standard errors here as they are those used in Glick and Rose (2016) and Rose (2017), with which I wish to draw closest comparison. I caution that in addition to all PPML estimates, the log gravity estimates on the truncated sample also lose statistical significance using exporter-importer-year multi-way clustering.Footnote 18

An interesting result in Table 3 is the consistency of estimates across samples in columns 2-5. Once dropping the observations with a zero estimated probability of treatment the PPML estimator appears quite robust to sample selection, with all estimates extremely close to my preferred specification using the log-gravity specification in column 5 of Table 2. It is also notable that the ad hoc sample, which in general was a poorer fit than the truncated p-score sample in Fig. 2, has lower estimates. I interpret the results in this table as suggesting that the PPML estimate is considerably more robust to the selection issues from log-gravity specifications. While the full and ad hoc samples differ dramatically, the large reduction in sample when using the full propensity score sample to the truncated sample results in only small differences in my estimates. The differences when using standard PPML and the IPWRA framework are also small. Researchers truncating in arbitrary ways should be careful to at least justify such choices based on comparability of treatment and control and consider if they might be throwing away good control observations (while including some bad ones), as this exercise might suggest.

4.1 Robustness to First Stage Specification

Above I present baseline results using the third model in Table 1. This is my preferred specification for the paper for two primary reasons. The first is that it maximizes first stage fit, as measured by pseudo-\(R^{2}\). The second, and importantly for my motivation for study, is that it retains enough data in the non-truncated sample to demonstrate that my log gravity results closely mirror the large EMU effects found in work such as Glick and Rose (2016).

A logical question would be whether or not the above results are highly sensitive to the choices made in this first stage model. To answer this I replicate my main propensity score results for each of the models in Table 1 to show how the implied trade effect differs across them. These are reported in Table 4 in the same order they are presented in Table 1 (the baseline estimates are thus the third and included for ease of comparison). To conserve space, I exclude the full sample estimates (which are by definition unchanged) as well as the ad hoc upper income sample selection. For the latter the unweighted sample estimates would be identical across specifications and the weighted estimates in column 7 of Tables 2 and 3 differ both due to weighting and due to further sample selection, making the source of variation difficult to compare across these groups.

Table 4 EMU Effect on Trade: First Stage Robustness

The first result to note in Table 4 is that log gravity estimates are much more sensitive to these first stage specifications than PPML estimates. This exercise should serve as a strong advertisement for the robustness of the PPML estimator. The second specification, using the same set of variables as my baseline but with only linear controls, keeps much more data on truncation than my other specifications. This is because the model fits much lower values for some euro members, and I keep the criteria described above for truncating based on the support of the treated observations. Since the minimum of p-score for this model is 0.0001, this a large amount of relatively poorly fitting controls are now retained. As a result these are the only results on the truncated sample that differ substantially from the baseline. The preferred truncated estimates in the PPML on the other hand are quite similar across all models, with weighting seeming to consistently pull the point-estimate in the direction of the 5% EMU trade effect found in the baseline.

I report the overlap and differences in means for treated and control observations for each of these in Appendix A. However, it is worth devoting some space to exploring these for the model that includes European Union membership. Recall from Table 1 that while this model had lower fit, as measured by pseudo-\(R^{2}\), it had by far the highest mean probability for the treated sub-population. In Fig. 3 I report the k-densities of the estimated probability of treatment across the treatment and control groups using this sample. The plots are for the range of propensity scores among the EMU population, which for this specification is: [0.011, 0.995]. This is an ideal first-stage overlap figure as it demonstrates both overlap across both groups, but also is clearly capable of discerning EMU membership from non-members. In many ways this addresses the main concern that Rose (2001) made in response to the Persson (2001) matching estimator.

Fig. 3
figure 3

Overlap: K-density plots of treated and control propensity scores (EU first stage)

Turning to the differences in means, it is clear in Fig. 4 that this model brings the differences between these two groups quite close to zero, particularly with weighting and truncation, where only two of the ten controls shown have means that are significantly different from each other at the 5% level. Part of the reason for an improved fit, is that this sample is limited to EU countries with non EMU members of the broader union sharing many characteristics with their eurozone counterparts. In some ways this model can be seen of a hybrid of simpler ad hoc measures with my data driven approach. By including a perfect predictor, like EU membership, the sample becomes highly constrained, but in a way that provides actual benefits in terms of improving the comparability between the treated and control groups. The propensity score model still works to improve this fit in two ways. First through weighting, as can be seen by the shifts toward zero from the unweighted versions (non-zero and truncated) when propensity score weights are added. The second way is by removing poor comparisons through truncation, demonstrated in the shift toward zero from the non-zero p-score (unweighted) estimates to the p-score truncated (unweighted) group.

Fig. 4
figure 4

Difference in Means (Treatment - Control), EU First Stage

5 Conclusions

Eurozone membership was not assigned by a researcher, but the culmination of intense policy deliberation. While it may not be possible to fully rectify selection effects of this policy’s effect on trade, propensity score methods offer a way of improving the credibility of estimators to take such differences between EMU and non-EMU pairs into account. Log gravity equation estimates of the EMU treatment effect are extremely sensitive to sample changes. I argue that this sensitivity largely reflects improvements in the control sample, bringing their observable characteristics closer to those of EMU countries. Ad hoc measures do a good job fitting on a given statistic, as my example using a GDP per capita cutoff to determine selection into the estimation sample, but cannot balance the potential trade-off between improving comparability along many dimensions. I argue that data driven propensity score approaches provide a better way of improving the comparability of these groups, while not relying on potentially arbitrary decision making of researchers. Use of the doubly robust IPWRA estimator provides a method for re-weighting the average treatment effects from gravity equations to further improve this fit, but in practice has small quantitative implications relative to the effects of truncation.

Results using samples where observable EMU and non-EMU characteristics are most comparable suggest a small and positive impact of the currency union on trade. This is roughly a 5% increase that is quite similar across my preferred log and PPML specifications. Limiting the sample to estimate thse effect among European Union measures might increase this slightly to nearly 7%. These are in line with the survey of the literature by Polák (2019), and explain why the large estimates of Glick and Rose (2016) are likely biased upwards. While Rose (2017) identifies the correct reason for differences in trade estimates of the euro, the evidence presented here suggests that his conclusion that more data should be better, is likely mistaken.

More broadly my results suggest that greater care should be made in sample selection in trade and other macro-policy environments. Observational studies can leverage models of first stage treatment to ensure better balance between countries that are exposed to a particular policy and those that are not. While this does not completely bridge the gap in causal identification between observational studies and the experimental ideal, it provides an empirically driven step in the right direction. The PPML estimator is much more robust to changing sample, though still reflects statistically meaningful shifts when moving from the full sample to those that drop the most extreme outliers and when using the IPWRA adjustment. This suggests that research using log approximations should be particularly careful of these sample selection issues.

In my preferred estimates the EMU effect is small, but still meaningful. Future work might seek to better model the selection process, complementing trade data with richer controls to better capture the geopolitical decision making process. Theoretical grounding in the literature on optimum currency areas stemming from Mundell (1961) may also prove a fruitful avenue for improving on the estimates presented here. Empirically these methods may be useful in quantifying the role that the euro has played on trade between eurozone countries and other partners, as Martínez-Zarzoso (2019) explores for EMU trade with the CFA franc countries. Another potential application is exchange regime choice and currency networks, as studied in Bleaney and Tian (2012), to understand the role that such selection has on estimates of exchange rate regimes on other macroeconomic outcomes. Better understanding the such additional channels through which membership in the EMU affects finance and trade will be important in getting a full picture of the economic benefits enjoyed by member states.