When Context Matters: Assessing Geographical Heterogeneity of Get-Out-The-Vote Treatment Effects Using a Population Based Field Experiment

The presence of heterogeneity in treatment effects can create problems for researchers employing a narrow experimental pool in their research. In particular it is often questioned whether the results of a particular experiment can be extrapolated outside the specific location of the study. In this article, we use a population-based field experiment in order to test the extent to which treatment effects for impersonal mobilisation techniques (direct mail and telephone) are sensitive to where they are carried out (geography) and the context of the election in which they were conducted. We find that on the whole it does not much matter where an experiment is conducted: the treatment effects are to all intents and purposes geographically uniform. This has important implications for the external validity of get-out-the-vote field studies more generally, especially where single locations are used. However, there is one important exception to this: experiments carried out in high turnout locations at high salience elections may show larger effects than those carried out in low turnout areas.


Introduction
Get-out-the-vote (GOTV) field experiments have an important and long history in political science, going back to Eldersveld 1956 study and before that to Gosnell's (1926). More recently, Gerber, Green and colleagues Green 2000a, b, 2001;Gerber et al. 2003;Green 2004;Green and Gerber 2008) have used randomised control trials that show that face-to-face mobilisation has a strong effect on voter turnout and is far more effective than less personal methods, such as telephoning and direct mail (see also McNulty 2005). In a short space of time the number of these experiments have increased dramatically, covering different populations (adults, young people, different ethnic groups); mobilisation methods (door-to-door, phone-banks, direct mail, leafleting, election-day mobilisation, robocalls, email, radio broadcasts, TV adverts, print media, and street signs); variations in delivery (timing, tone, quality); partisan and non-partisan interventions; bilingual or multilingual modes of delivery (see Green and Gerber 2008). 1 Green et al. (2010: 3-4) note that in respect to direct mail alone ''from 1999 through 2009, a total of 93 independent experiments were conducted, encompassing 127 treatments reported in 40 distinct studies''. An increasingly important line of enquiry is the heterogeneity of treatment effects. A number of studies have explored the conditions under which treatment effects vary from population to population, from study to study and by treatment design (Imai and Strauss 2011;Arceneaux and Nickerson 2009;. The presence of heterogeneity in treatment effects creates potential problems for researchers employing a narrow experimental pool in their research. This is a persistent critique of experimental studies: a lack of generalizability or external validity (Mutz 2011). Whilst field experiments enjoy the advantage over laboratory experiments that the treatments are tested in realistic settings, it is often questioned whether the results of a particular experiment can be extrapolated outside the specific location and to a generalised situation (Druckman and Kam 2011). Mutz (2011) has argued that the traditional goal of internal validity need not be sacrificed in the search for external validity if researchers adopt population based experimental designs. However, because of the difficulty in carrying out large-scale field experiments across large areas and over time, most GOTV studies have been focused on a single area at a single election (or a small group of geographically proximate locations) and for a single group of the electorate (notable exceptions include Green et al. 2003;Nickerson 2006;Bennion and Nickerson 2010).
Meta-studies potentially allow researchers to compare treatment effects across studies and draw inferences about generalised effects ). However, the sheer variety of these kinds of experiments, encompassing variations in design and mobilisation methods as well as target population, militates against generalisation. A meta-analysis may suffer from a high degree of heterogeneity in various elements of design, (Crombie and Davies 2009;DerSimonian and Laird 1986) a problem for which there is no easy fix. When comparing studies, it is difficult (if not impossible) to separate variation caused by the use of different mobilisation methods from variation caused by unit or geographical heterogeneity. The challenge is to design a study that can make population inferences in the presence of heterogeneity.
Whilst a narrow experimental pool does not necessarily threaten causal inference, if heterogeneity in treatment effects does exist, then to achieve valid causal inference it is necessary to (a) sample some variation on the key moderating variables; and (b) allow the treatment effect to vary, for example by including the interaction of these moderating variables with the treatment effect (Druckman and Kam 2011). But what are these key moderating variables? They could be in the individual or unit characteristics such as political sophistication or demographic characteristics. Here we focus on geographical or contextual factors. Elections are highly heterogeneous across space and it is likely that treatment effects may vary across different types of areas, for example those with high prevailing levels of turnout compared to those with lower levels, or marginal as supposed to safe seats. Whilst single location studies have the potential for examining variability in treatment effects across different categories of elector, such as high versus low propensity voters (e.g. Niven 2001), only studies with variance on all the relevant dimensions of electoral context are capable of identifying the potentially crucial role of local electoral context.
Given that we may theoretically expect heterogeneity across different political contexts within a single country, then ideally we need a nationally representative sample of voters across a sample of electoral districts and across different elections. In this article, unlike any other previous GOTV studies of which we are aware, we use such a design to test the extent to which treatment effects for impersonal mobilisation techniques (direct mail and telephone) are sensitive to where they are carried out (geography) and the context of the election in which they were conducted. One of the considerable advantages of a nationally representative multifactorial design is that it renders possible the examination of the heterogeneity of treatment effects across space. Of course there are other dimensions of heterogeneity which we cannot capture with a nationally representative population-based experiment including variation by country, over time (beyond the two elections sampled) and for different types of intervention. However, because the study is based on a nationally representative sample of electors drawn from a random sample of electoral districts (wards) we are able to explicitly test whether the effectiveness of an impersonal nonpartisan intervention varies across different political contexts measured on a number of different dimensions. These are selected because of their potential theoretical relationship with treatment effects and are described in the following section.

Underlying Level of Turnout
At the individual level, it has been noted that electors with a high underlying propensity to vote are less likely to be swayed by a leaflet or phone call (Hillygus 2005). Conversely, those with a low underlying propensity to vote may be difficult to persuade to change their mind (Niven 2001). Integrating these ideas, Arceneaux and Nickerson (2009) predict a curvilinear relationship between the individual level underlying propensity to vote (or level of interest) and the efficacy of intervention with the point of optimum efficacy depending on the salience of the election (Arceneaux and Nickerson 2009). Thus, in low saliency elections it is relatively high propensity voters who are more likely to be on the cusp of their personal voting (or indifference) threshold. Extending this to the aggregate (constituency) level we might expect that areas with middling levels of turnout are more likely to be productive for campaigners than those with very high or very low levels, in medium or high salience elections. In areas with very high levels of turnout, the average propensity to vote is likely to be exceptionally high and many voters would vote regardless of the intervention, except in low salience elections when more voters may be close to their voting threshold. By contrast, in very low turnout areas it is likely that electors, on average, are less susceptible to mobilization. In these areas, the average latent propensity to vote is lower and, given that the treatment is likely to raise this propensity by only a small amount, then the proportion that are raised above a critical threshold is likely to be low, except when the election salience is very high. In accordance with those who advocate the curvilinear argument, ''GOTV efforts are likely to mobilize voters who fall in the middle of the voting propensity spectrum'' (Arceneaux and Nickerson 2009: 3). By extension GOTV campaigns may be likely to mobilise those living in areas of mid-level turnout, though this may vary according to the saliency of the election (for example, contrasting a European and General Election as we are able to do here). In other words, we extend the logic of curvilinear contingent theory of turnout of Arceneaux and Nickerson (2009) to apply to geographical electoral districts, and more specifically the relationship between mobilization efficacy, the underlying or prevailing level of turnout and election salience.

Electoral Competitiveness
The competitiveness of the electoral contest has a bearing on where a party or candidate campaigns. Parties target campaign resources where the contest is close as it is in these marginal seats where party activism it is likely to have highest potential impact. A large body of literature shows that local party campaigns are effective at mobilising party supporters (Denver and Hands 1997;Johnston and Pattie 2006;Fieldhouse and Cutts 2008;Cutts 2006). Any non-partisan GOTV campaign must, therefore, vie with other campaigns for the attention of voters. Where party campaigns are intense, voters who are most likely to be persuadable by mobilisation techniques may be mobilised by parties regardless of the intervention being studied. In other words, the more marginal the seat, the more intense the party activism, and the greater the likelihood that the experimental GOTV treatment is to be ''drowned out'' by other interventions, since the control group will be likely to receive a large amount of election information that has nothing to do with the experiment.
There are also alternative reasons why the electoral competitiveness of the seat could drown out non-partisan GOTV effects. Those electors living in seats where the contest is highly competitive are likely to be aware of the seat status, and as a consequence, more likely to have heightened levels of political awareness and have greater local political knowledge. Of course, this in itself may be a function of intensive party campaigning, but also other factors such as the media (old and new) and more politicised social networks. The decision about whether to participate or not is also more likely to be made in the knowledge that, unlike many electoral contests in other places, it could have a bearing on the final outcome.
In this study there is a range of geographical areas which make it possible to explore this relationship. Here we use a marginality variable-identifying those seats where the margin is less than 10 %-which not only captures the intensity of campaigns carried out by political parties but also reflects the higher levels of political knowledge and interest among those electors living in seats where the electoral contest is more competitive. Margin also has an additional advantage over the use of a campaign measure such as party campaign spending, insofar as it is easier to replicate in other contexts.

Party Control
As well as differing in respect to the prevailing level of turnout and the level of competiveness, parliamentary constituencies vary in a number of other politically relevant ways that may affect the efficacy of GOTV treatments. In general, such factors reflect the character of the constituency in relation to the prevailing political cleavages of the nation (Agnew 1987). The most important of these include the socio-economic and demographic profile of the seat, its' local political culture and history, and the personal profile and support of local candidates. Given that, by their very nature, these are all correlated with the popularity of each of the major political parties; party incumbency provides a useful proxy for these sources of variation. Thus, for example, the social profile of constituencies (whether it's predominantly working class or middle class) is highly correlated with the identity of the incumbent party. Moreover, in any given election the nature of the campaign may be shaped by whether the defending incumbent is from the governing party or the opposition. For example, for any given level of competitiveness, because of the relative unpopularity of the government at the time of the 2010 general election, sitting Labour MPs were more likely to be under threat of losing their seat than those of opposition parties. In order to capture these differences and to test for potential biases among experiments carried out exclusively in government controlled or opposition controlled seats, we split the sample according to whether the incumbent MP was from the Labour Party (the governing party going into both elections) or an opposition party.

The Electoral Context
Electoral turnout varies according to the electoral context (Marsh 2002;Franklin 2004;Fieldhouse et al. 2007). As noted above Arceneaux and Nickerson (2009) argue that the point of optimum efficacy of a treatment will depend on the salience of the election. Although plausible, there is limited hard-evidence that the salience of the election is systematically related to the size of treatment effects across experiments. Green et al. (2010) for example, find no significant variation in treatment effects by salience of election across 41 experiments carried out in the US. In this study we are able to compare treatment effects for a second order (European) election with a first order (general) election.
Following from above we test the following null hypotheses: H 0(1) : Treatment effects do not vary significantly between electoral wards (sampling units); H 0(2) : Treatment effects do not vary with the prevailing level of turnout in the ward; H 0(3) : Treatment effects do not vary with the marginality/competitiveness of the electoral district (constituency); H 0(4) : Treatment effects do not vary with the party of the defending candidate; H 0(5) : Treatment effects do not vary with the with the type of the election (general versus European).

The Study
The study was designed to examine the effect of non-partisan mobilisation, through telephone canvassing and direct mail, on voter turnout in the European elections in England on June 4th 2009 and the UK General Election on May 6th 2010 (see Fieldhouse et al. 2013). In a multistage design twenty-seven local authority districts were randomly sampled and three electoral wards were randomly selected from each sampled district. The sample of wards provided a close match to England as a whole on a range of social and political characteristics. 2 Using a database based on electoral registers and telephone records, 40,000 individuals were sampled from these eightyone wards. By design all sampled wards contain individuals from treatment and control groups in randomly distributed proportions. The sample was restricted to one random person per household to avoid clustering, and to ensure households did not receive double treatments. This sample was further stratified according to telephone accessibility and therefore included two separate sub-samples made up of 26,500 telephone accessible electors (any record with a valid landline or mobile) and 13,500 individuals telephone inaccessible electors (anyone with no telephone contact information). Each sampled telephone accessible individual was randomly assigned to one of three treatment groups (telephone, mail, or mail and telephone) and telephone inaccessible to the mail or control group. Because of the different treatment combinations available and their different effectiveness, in the following analyses we split by (or control for) telephone accessibility. After the randomisation was complete, any electors in the sample (treatment or control groups) that were not registered or not eligible to vote were removed, leaving a sample of 25,293 in 2009. This reduction reflects redundancy in the sampling frame particularly arising from non-registration (since we include only registered electors in the analysis). At the General Election of 2010, we canvassed the sample again, but with the difference that we randomly allocated a portion of the 2009 control group to a new mail and telephone treatment group. Members of the three 2009 treatment groups were assigned to receive a repeat dose of the same treatment in 2010. A proportion of the sample that was included in 2009 had left the electoral register in 2010 or had changed name/address details and was therefore excluded, leaving a sample of 21,984 in 2010. Further details of the study design are reported in Fieldhouse et al. 2013).
The intervention consisted of a GOTV campaign called 'Your Vote' which encouraged recipients of the treatment to vote for reasons of civic duty and expressive motivation. Telephone recipients received a brief phone call from a team of social science graduate students. Non-respondents were called back on at least five occasions at different times of the day to maximise the overall contact rate. The mail group received a personalised printed letter in a colour with almost identical message (tailored for the written word).
The total number of registered electors in the sample was 25,293 in 2009 and 21,984 in 2010. Of those in the telephone treatment group, 58 % were successfully contacted in 2009 and 78 % in 2010 (Fieldhouse et al. 2013). Official records of voter turnout were collected after both elections to verify the turnout of treatment and control groups. In 2009, 17 % of electors in our sample voted by post, and 20 % did so in 2010. As a result of electoral law, there is no public record that indicates whether, individually, these people cast their vote and therefore postal voters are treated as missing data and excluded from all analyses. Moreover, applications for postal vote could not be influenced by the treatment as the closing date for applications (11 days before polling day) had passed when the treatments were applied.

Results
Before examining whether there was any significant variation in treatment effects between areas and across elections, we start by summarising the estimated treatment effects for the GOTV experiment overall. In this paper we focus on the overall intent-to-treat effect (ITT) as defined by the comparison of the sample assigned to any treatment group and the control group, since this provides the largest available sample, and therefore the best test of heterogeneity between geographic areas. The ITT simply compares the treatment and control group on the basis of assignment. It gives a conservative estimate of the average treatment effects, as it does not adjust for non-contact. This approach is preferred here as contact rates were not available for all treatment types. 3 Table 1 shows the estimated ITT for the overall treatment for 2009 and 2010 split by telephone accessibility.
In both elections, the overall treatment effect was positive but statistically insignificant for the telephone inaccessible treatment group. In 2010 this largely reflects the lesser effectiveness of the mail treatment effect compared to the telephone or combination effect, but in 2009 it also reflects a weaker mail effect in the telephone inaccessible group (see Appendix Table 7). Amongst the telephone accessible treatment group, the overall treatment effect was significant at both elections. The largest effects were for those receiving the combined treatment and, in 2010, for the telephone treatment.
Although Table 1 shows a larger effect in 2010 than in 2009, we cannot simply compare the overall treatment effect at the two elections. To make this comparison we must focus on the combination treatment (rather than the overall ITT), because the mail and telephone separate treatments are not strictly comparable between elections, as the 2009 mail and telephone groups were re-contacted in 2010. Table 2 therefore compares the effectiveness of the combination treatment across two different elections. The comparison of 2009 and 2010 gives an excellent test of the relevance of electoral context when comparing experiments, because the combination treatment was identical at both elections and carried out in exactly the same geographic locations.
The 2010 election was a first order election with a high-level of salience and the resultant level of turnout was much higher than in 2009 by a factor of two (nationally turnout was 65 % in 2010 compared to 34 % in 2009). Whilst there is reason to suppose the relationship between salience and the efficacy of GOTV treatments will depend on individual propensities to vote (Arceneaux and Nickerson 2009), overall the low level of interest in 2009 and the disillusionment with party politics prevalent at the time, appears to have limited the effectiveness of the 2009 ITT is equal to the percentage point difference in the turnout between those assigned to any treatment and the control group. The standard errors = H(pq/n). P-values derived from standard comparison of proportions z-test. Tests based on one-tailed test of significance as effects are hypothesised to be positive * Significant at 0.05 (one-tailed test) treatment relative to 2010. However, the t-statistic for the difference in treatment effects is not significant and therefore we cannot discount H 0(5) . In other words there is no firm evidence that the treatment varies significantly between elections although the direction and magnitude of the effects do indicate that the treatment may have been more effective at the 2010 high salience election.
Comparing Between Areas and Within Elections Figure 1 shows the relationship between ward turnout and the size of the overall treatment effect for telephone accessible electors for each ward in 2010, depending on the prevailing level of turnout in the ward, as measured by the turnout of the control group in the ward at the previous election. 4 Although each ward estimate is based on small numbers, there appears to be a very weak relationship between the  underlying level of turnout and the treatment effect. In 2009 this relationship is slightly negative and in 2010 slightly positive but the R-squared at both elections is less than .01. This provides prima facie evidence that there is no strong or consistent relationship between the prevailing level of turnout and the efficacy of the treatment within a single election. In other words there is no systematic relationship between the underlying turnout level and the effectiveness of the treatment.

Modelling Variation in Treatment Effects
Above we showed that there is a weak relationship between the local treatment effect and the underlying level of turnout. However, although at the aggregate level this was a large-N experiment, when disaggregated to ward level, the sampling error around each individual ward estimate is quite large. In order to test the overall significance of variation in the treatment effect between wards we use multilevel (hierarchical) models, where vote is the dependent variable, and the independent variable is the treatment assignment (hence we are estimating the ITT). The hierarchical approach allows us test for variation in the level of turnout (the intercept); the treatment effect (the slope) and more particularly the covariance of the two. The covariance tells us whether the size of the treatment effect (the slope) is correlated with the local level of turnout (the intercept). It also allows us to test whether across the overall sample these random effects are statistically significant. The hierarchical logistic models are fitted using MLwiN 3.2, with the estimates for the model derived using a Markov Chain Monte Carlo (MCMC) estimation procedure (Browne et al. 2005). Snijders and Bosker (2011) state that it is common to estimate hierarchical models using estimation methods based on marginal quasilikelihood (MQL) or penalized (predictive) quasi-likelihood (PQL) procedures. However, when fitting binary response models, both of these quasi likelihood estimators can lead to an underestimation of the random effects, particularly when they are large and there are small numbers of observations within higher-level units, as is the case with our sample (Browne et al. 2005;Goldstein and Rasbash 1996;Rodriguez and Goldman 1995). Recent evidence also suggests that the Bayesian estimation procedure (MCMC method with diffuse priors) is less biased than either of the quasi-likelihood methods for binary response models (Browne et al. 2005). Moreover, if there is any higher level variation we want to be sure we find it, so it is imperative to use the MCMC approach.
Here, we used MLwiN software to estimate the starting values using first-order PQL, then 5,000 runs to derive the desired proposal distribution (discarded after convergence of the ''burn in'' period), followed by 50,000 simulated random draws to obtain the final estimates. We use the Metropolis-Hastings algorithm and the default diffuse gamma priors for variance parameters. The estimates in Table 3 are based on the mean of the simulated values, and the significance is derived from the standard error which is the standard deviation of the converged distribution. These estimates correspond to the traditional maximum likelihood estimate and its standard error. Table 3 shows the summaries of model results for the overall treatment effect, comparing any person allocated to any of the three treatment groups with the overall control group, regardless of whether they have telephone information or not. Telephone accessibility is controlled for with a covariate in the model. The overall treatment effects were statistically significant at the 5 % level in both elections, as represented by the overall effect size. Looking at the random effects, turnout varies significantly by ward at both elections, as represented by the intercept variance. This is unsurprising, and simply reflects geographical variation in the underlying level of turnout. What is more important is that there is no significant variance in the slope (the treatment effect) in either 2009 or 2010. There is also no significant covariance between the intercept and the slope, suggesting no systematic relationship between the local treatment effect and the level of turnout. The analyses were repeated for each of the separate experiments at both elections. In no instances across the two elections and across any of the methods of mobilisation, either alone or in combination, was there significant variance in the slope (the treatment effect), or the co-variance of slope and intercept (the tendency to vary according to the turnout rate). 5 We therefore cannot reject H 0(1) or H 0(2) .
It is possible to compare the relative effectiveness of different models-in our case the baseline random intercepts model against the random slopes model-and evaluate their goodness fit by using the Deviance Information Criterion (DIC) (Spiegelhalter et al. 2002;van der Linde 2005). The DIC can be calculated from an MCMC run by calculating the value of the deviance at each iteration, and the deviance at the expected value of the unknown parameters. The DIC statistic also accounts for the number of parameters in the model, with a difference of less than 2 between models suggesting no difference, while a difference of 10 or above indicating an improvement in the goodness of fit (Burnham and Anderson 2002). A comparison of the DIC with random slopes and without (random intercept only) suggests there was no difference between the models for any of the treatments at either election (see Appendix Table 10 for further details). In other words, there was no improvement in model fit by relaxing the assumption that treatment effects are equal across geographical areas. Telephone accessibility included as a control * Significant at P B 0.05

Sources of Variation
The multilevel models allowed us to test for overall variation in the treatment effects and whether it varies with the overall level of turnout. We found no evidence that it does either. However it may be possible that there is some variation along the specific dimensions discussed above (electoral competitiveness and party control). As noted by Druckman and Kam (2011) where there is a theoretical expectation of heterogeneity in treatment effects we need sufficient variation in the key moderators (in our case political context), which is achieved through the sampling of 81 geographical locations. However, for valid causal inference these moderators must be interacted with the treatment. We test this by fitting fixed effect logit models with interactions between treatment effects and indicators for each of the relevant moderators. More specifically, we examine whether the impact of the intervention on turnout varies with electoral competitiveness of the seat (marginality), party control of the seat (Labour incumbency), and prior turnout (high, medium and low). Whilst there are some potential problems in using models containing covariates to adjust for imbalance (Bowers 2011), the model-based approach provides an excellent approximation of randomisation-based differences of means (Green 2009). 6 Moreover, there is no evidence of such imbalance in our sample and model estimated average treatment effects are almost identical to unadjusted effects (see Fieldhouse et al. 2013). The purpose of the models presented here is not to adjust for covariate imbalance or improve the estimation of the ITT per se, but to estimate the co-variation of the treatment effect and the contextual moderators defined above. 7 As a check on the model based results, we also stratified the sample according to the 6 There has been much scholarly debate about the use of multiple regression to analyse experimental data. The main argument is that the introduction of assumptions associated with multiple regression are not justified by randomization and that the difference in means is the most appropriate estimator (Freedman 2005). Green (2009) provides a robust defence for the use of multiple regression in experimental analysis. Green (2009) uses a number of hypothetical examples and a voter mobilisation mail experiment to show that the discrepancy between the average multiple regression estimate and the true average treatment effect is negligible both in substantive terms and in relation to the standard error. In summary, multiple regression provides accurate estimates and standard errors, and this is the case even when the sample size is relatively small (Green 2009). 7 Green and Kern (2012) do, however, claim that some obstacles exist including the possibility of specification error, multicollinearity when a large number of interaction terms are used and data-dredging where the researchers search for treatment-covariate interactions until they discover 'interesting' heterogeneity for some subsets of experimental units. Here we use the multiple regression method (inclusion of covariate and treatment-covariate interaction) as a method for estimating treatment effects and argue, like Green (2009), that it is identical to the traditional way of calculating the difference in means (splitting the sample). Our models carefully adhere to the set assumption. We explicitly test for specification error (using the linktest command in STATA 12) and find no evidence of this in our models (the _hatsq is insignificant, for instance in the incumbency model it has a P value of 0.28). We also find no evidence of serious multicollinearity. Our models only contain one interaction so the concerns raised (multiple interactions in the model) by Green and Kern (2012) is not valid in this case. Finally, the saliency of electoral competitiveness, underlying turnout and party control on 2009/10 turnout is well documented, not just here, but in the wider discipline and are selected for theoretical reasons.
key contextual variables and calculated simple unadjusted treatment effects for the relevant groups. 8 These results are discussed further below. Table 4 shows the overall treatment effect, the coefficients for two of the key contextual variables (marginality and incumbency) and the interaction between treatment effects and the contextual variable on turnout in the 2010 General Election. 9 Looking first at marginality, the overall treatment effect was statistically significant at the 5 % level. As expected, the 'margin' main effect was significant. Those individuals living in the most competitive seats were more likely to vote than electors living in much safer seats. However, there was no evidence that the treatment effects varied by the marginality of the seat. This was confirmed by splitting the sample into marginal and non-marginal wards and estimating treatment effects for the separate sub-groups (see Appendix,Tables 11,12,13). For both telephone accessible and inaccessible, although treatment effects were larger (and only significant) for non-marginal seats, the confidence intervals overlap, suggesting the treatment effects do not differ significantly. Similarly, we find no evidence that treatment effects vary by party control. People living in seats where there is a Labour incumbent were less likely to turn out, hardly surprising given the socioeconomic characteristics of many of these constituencies and the electoral context (with Labour as the governing party losing support). As a consequence, party supporters in these areas where Labour were strong may have been less inclined to participate. However, this did not have any bearing on the efficacy of the treatment, and there is no significant interaction with the treatment effect. Again, this is confirmed by the split sample analysis. As for marginality, for the telephone accessible sample, the treatment was statistically significant in one group (non-Labour incumbents seats) but not the other (Labour seats), but the two samples did not differ statistically from each other. Given these findings, we therefore cannot reject H 0(3) and H 0(4) . There are alternative ways of estimating heterogeneity of treatment effects based on Bayesian statistical decision theory (e.g. Imai and Strauss 2011). 9 We also tested the effects of party spending using both a dichotomous variable (high spending versus low spending) and an overall spending measure obtained from the electoral returns of the three main parties during the 2010 official election campaign period. We found that both measures of spending had no significant effects reflecting the lack of variation in the spending variable. Table 5 shows the results of whether the treatment is related to prevailing turnout-through the splitting of the sample according to whether the overall level of turnout in the area is high, medium or low (allowing for a curvilinear relationship). We used previous local election turnout for the 2009 model (as defined in Fig. 1) and prior turnout in the 2009 European elections (from our control group sample) in the 2010 model. Because the 2009 election was a second-order low-salience election and the 2010 election was a first-order/high-salience election, the underlying turnout rates were defined in relative terms with three equal; sized categories at each election. 10 Unsurprisingly, in both 2009 and 2010, those individuals living in higher and medium turnout areas were significantly more likely to vote than those living in low turnout areas. Of more significance were the findings of the interaction between the treatment intervention and the local prevailing level of turnout. In 2010 (but not 2009) the overall treatment, had a significantly greater impact in high turnout areas. The split sample analysis (for the telephone accessible sample) also shows a larger effect in high turnout areas, though the confidence intervals do overlap (see Appendix Tables 11, 12). 11 The greater efficacy of the intervention in high turnout areas, at the high salience general election (where overall turnout was 65 %), is consistent with an individual level phenomenon of maximum treatment effects for high propensity voters (e.g. Green 2004). By contrast, there is little support for the aggregate level equivalent of the (contingent) curvilinear theory (cf. Arceneaux and Nickerson 2009) which would predict the largest treatment effects in high-turnout areas in 2009 (a mid-salience election where the average turnout is around 50 % in high turnout wards) or in mid-turnout areas at the high salience 2010 election (again, where average turnout is around 50 %). However, it should be remembered that we are testing an aggregate level theory concerning the underlying level of turnout in the area, so we are not making any claim about the veracity of the individual level curvilinear theory, only that it does not appear to apply at the aggregate level in the way hypothesized.
Overall, there was some limited evidence that the treatment effects varied with the prevailing level of turnout in the area, with the treatments being very slightly more effective where turnout was already high in a high salience election.

Conclusions
The nationally representative sample allowed us to explore geographical variations in the effect of the treatment across two very different elections. This multi-factorial design not only allowed us to examine the heterogeneity of treatment effects but also to make comparisons between the treatments as applied to different sections of the population. We examined one theoretically important source of potential variability, namely heterogeneity across space. More specifically whether the treatments effects were equal across different types of area, those where a party was in control, where the seat was competitive and those areas with high prevailing levels of turnout compared to those with lower levels. We proposed a number of null hypotheses which explicitly tested this.
The findings were largely consistent. First, there was no conclusive evidence that the treatment varied significantly between elections, though there was some indicative evidence that the treatment was more effective in the high salience first order election of 2010. Second, there was no significant variation in the treatment effect across geographical areas. In 2009 and 2010, whilst turnout varied by ward (the intercept variance) there was no significant variance in the slope (the treatment effect) in any of the multilevel models. We then tested whether there was any variation in the treatment effects along specific dimensions including party control, the electoral competitiveness of the seat, and the prevailing level of turnout in the area. There was no evidence that the treatment effects varied significantly by the marginality of the seat or by party control. However, there are two significant caveats to this conclusion. First, whilst overall variation was largely insignificant, and the estimation of split sample treatment effects showed that subgroups did not generate statistically significant differences to each other, there were a number of instances where the ITT for some subgroups were statistically significant to zero and others were not (non-marginal sets, non-Labour incumbent seats and high turnout seats). This suggests that selection of geographic location can make a difference as to whether significant effects are uncovered or not, especially where effects are close to the threshold of statistical significance. Second, in 2010 the overall treatment had a significantly greater impact on turnout in high turnout areas. Just as some previous research has shown, treatments may be more effective amongst regular previous voters (Green 2004;Niven 2001). At the aggregate level our GOTV treatments did appear to be more effective in higher turnout areas, in the higher salience general election. This is consistent with an individual level inference that it may be easier to nudge those already likely to vote, than it is to change the mind of ardent non-voters. However, our results relate to the characteristics of areas, not voters, so it is more accurate to say that campaigning may be most effective in high turnout locations at higher salience elections.
Notwithstanding this, overall it seems, taking the geography of treatment effects as a whole, it does not matter too much where an experiment is conducted: the treatment effects are to all intents and purposes uniform. This has important implications for the external validity of GOTV field studies more generally, especially where single locations are used (which lack of variation on key contextual moderators). It is possible to use these findings to conclude that the effects of single-location GOTV experiment can be extended to a wide range of locations (within a single election) without serious threat to causal validity. However, researchers should be warned that experiments carried out in high turnout locations are likely to show larger effects than those carried out in low turnout areas. Similarly campaigners might be interested to know that an additional leaflet or telephone call in a high turnout area may be more effect than the same leaflet in a low turnout area -though of course the additional voters may be less likely to be pivotal in those areas. Whilst these findings are important for researchers and campaigners alike, we should stress there are unanswered questions, not least whether larger samples or different electoral contexts might throw up more statistically significant patterns of variation. Future work based on meta-data could test whether the heterogeneity in existing studies conforms to the patterns found here. Beyond that, a nationally representative sample from other countries including the US is the natural next step, to compare findings with this British study.