Across six experiments in two countries, the results suggest support for the argument of aid skeptics that foreign assistance will produce equivalent accountability demands from citizens as natural-resource wealth. We designed the experiments progressively, seeking to improve on the experimental designs from experiment to experiment and to address possible concerns in the successive redesigns. In all six experiments we find no evidence for our alternative hypothesis, inspired by earlier experimental findings, that aid would heighten citizens’ accountability demands compared to oil revenues (Milner et al. 2016; Findley et al. 2017). Meaningful differences do not arise in measured outcomes for subjects assigned to foreign aid compared to oil revenue conditions. In fact, when considered in percentage terms, the differences in the Aid and Oil conditions are so slight that estimated treatment effects are below 1 percentage for each of our three summary indices (0.64, 0.87, 0.94 for the survey, lab and field experiments, respectively).Footnote 14 These results are extremely small in substantive terms and consistent in magnitude with those found elsewhere (see e.g. Paler 2013). Effect sizes aside, there may be substantive and theoretical objections to our findings. This section discusses those we view as most important.
First, we have not so far examined whether aid and oil differ in plausible causal mechanisms. For example, perhaps subjects are disinclined to demand accountability differently for aid and oil because they see the money resulting in similar end results for public spending. In addition to the behavioral outcomes in the 2014 and 2015 survey experiments, we asked subjects an array of questions probing their perceptions of how the money might be used by politicians for their own, their families’, or their clients’ gain or, alternatively, for the public good. As additional results in Appendix C demonstrate, we cannot reject the null hypothesis of equal effects between aid and oil on subjects’ perceptions of public benefits or their anticipation of leakage to corruption or clientelism. In line with the criticisms of aid skeptics, these results suggest that foreign aid and oil have similar effects on policy, politics, and accountability pressures (de la Cuesta et al. 2019).
This raises a follow-up question: might a different type of aid, say bypass aid through non-governmental organizations, produce different accountability demands? Might it be better aligned with donor intent such that bypass aid evades corrupt governments and strengthens civil society (Dietrich 2013)? An alternative treatment condition in our first set of experiments explored this mechanism directly. In other published research we show results suggesting meaningful differences in outcomes for subjects assigned to an “NGO aid” condition compared to both on-budget aid and oil revenues (de la Cuesta et al. 2019). The differences are not universal—in particular there were few significant effects in Ghana—but compared to the null results for direct-to-government aid vs. oil revenues, the significant results for NGO aid are notable. These results support key claims of aid optimists that different channels of aid delivery may mitigate some of the anticipated negative effects of aid delivered directly to national accounts (Dietrich 2013; de la Cuesta et al. 2019). These results are also consistent with Heinrich and Loftis (2019) that democracy-focused aid, including aid to NGOs, might directly promote accountability demands. Future work might test the effects of aid specifically targeting democracy-promotion on citizen pressures for accountability. These different results for NGO bypass aid again highlight the lack of differences between budget-support aid and oil revenues in prompting citizen action. It is important to note that aid to governments, which includes budget-support aid and all grants and concessionary loans from the World Bank and the regional development banks, has comprised more than half of all foreign assistance in modern history (Tierney et al. 2011). By contrast, bypass aid through NGOs amounts to a comparatively small fraction of foreign aid.
As important as foreign aid and oil revenues are to many lower-income governments, tax revenues also prove vital to nearly all state budgets. The argument that tax revenues produce different accountability demands compared to windfalls motivated the early revenue-and-accountability literature (Huntington 1991; Tilly 1990; Jensen and Wantchekon 2004; Ross 2004). We explored this possibility with alternative experimental conditions comparing tax revenues to aid and oil windfalls. In the 2014 and 2015 survey experiments, taxes did not produce differential demands for accountability compared to windfalls, calling into question the earlier arguments about the superiority of taxes for democratic accountability (de la Cuesta et al. 2019). However, in the 2016 and 2017 lab experiments, we simulated taxation for subjects in an alternative condition by paying subjects a higher wage of 10 MU and then demanding half of the amount as taxes, which was then doubled and given to the Leader as the group fund (and is thus otherwise identical to the aid and oil conditions). This tax simulation did produce higher punishment thresholds than aid or oil, likely because the confiscation of subjects’ actual money generated both loss aversion and greater psychological ownership (Paler 2013; de la Cuesta et al. Forthcoming)Footnote 15. The differences between the simulated-income-tax condition and aid and oil windfalls were significant statistically but, however, relatively modest substantively. The differences between the survey and lab experiments resulted from the tax simulation in the lab setting and highlights that aid and oil produce no discernible differences in citizen accountability pressures in either type of experiment.
While the results from our six different experiments all suggest the same conclusions that foreign aid to governments and oil revenues produce equal accountability demands, a number of potential criticisms remain for this study. First, we are examining perceptions and behavior of individual citizen subjects in controlled laboratory or survey settings, not actual governance outcomes. We believe as others do that such perceptions and micro-level behavior are necessary first steps in producing aggregated, macro patterns in politics and policy. Indeed, this was a key contention of the literature openly worrying that aid would fail to produce accountability pressures (Knack 2001, 2004, Bräutigam and Knack 2004; Djankov et al. 2008). So micro foundations are critical to understand in their own right. Nevertheless, it is important to note that the evidence presented here does not reflect directly on macro outcomes, particularly those that are heavily influenced by institutional differences across the two sources, such as the presence or absence of third-party monitoring and enforcement provisions.
Second, and related, while the dependent variable focuses on citizen behavior, the behavior in the laboratory games may not generalize to the real world and the behaviors in the survey experiments may be unrepresentative because they are subject to researcher demand. While all behavior of subjects consenting to participate in a research study—and thus being aware their actions are observed—faces external validity concerns, the variety of outcomes assessed in six experiments provides reassurance that the results retain consistency across the multiple measures that are all plausibly reflective of actual political behavior. In the 2014 and 2015 survey experiments, subjects were invited to sign petitions, send SMS messages, donate money to good-government NGOs, and express their willingness to engage in other political actions. The studied behaviors for the surveys were inspired by actual civil-society campaigns and plausibly reflect real-world propensities, even if the survey setting made the behaviors more immediate and easier to accomplish.
In the lab games, while the setting was necessarily artificial, training and game-play emphasized that the money used and subjects’ actions in allocation and punishment were supposed to reflect actual public funds and political behavior, respectively. Subjects were directly and expressly placed in a mindset in which they were considering public policy and political action. Moreover, their actions in the lab games had real costs for their personal finances, so subjects were motivated to take them seriously. For the field experiment, we deliberately patterned the activities after citizen-information campaigns undertaken by non-governmental organizations and thus invited behaviors—sending SMS messages to officials, donating to NGOs, and requesting information—common in political activism. While researcher demand may affect base rates of outcome behavior, such rates might arguably be similar to participation generated by activists’ requests in NGO campaigns. And, critically, even if the base rates may not perfectly reflect actual rates of political behavior, the quantity of interest in analysis is the difference between rates of behavior across experimental conditions, so researcher demand, to the degree it is present, should pose little threat to causal inference.
Third, our data from four experiments focuses more on one country, Uganda, which may raise concerns about generalizability. But our efforts in two experiments in Ghana, which like Uganda depends on aid and oil as well as taxes for the majority of government revenues, nevertheless strongly reinforce our findings. Results were very similar even in wealthier, more oil-dependent, and more-democratic Ghana. Thus, the Ghana findings lend further credence to the broader claims we make: foreign aid to governments and oil revenues appear indistinguishable to citizens in terms of their political effects, and they do so in multiple, broadly representative countries. Relatedly, in the interest of producing the hardest test for a null hypothesis, except where indicated we did not implement a false discovery rate (FDR) correction. With the large number of tests, particularly in the surveys and survey-based field experiment, any method of controlling the false discovery rate would strengthen the evidence in favor of the null still further.
Fourth, the treatment in the 2014 Uganda and 2015 Ghana surveys, hinging as it does on a few words, may appear weak. While the treatment was designed to approximate the form in which actual voters would learn about government budgets, such as through a newspaper or radio report, the revenue sources were nonetheless only identified in a short prompt and not elaborated at length. Sensitive to this concern, in our 2018 survey-based field experiment in Uganda, we revised the instrument to address treatment strength. We fortified the treatment substantially through providing detailed village- and household-level information, drawing inspiration from civil-society groups and NGOs’ information campaigns. Even with this stronger treatment we achieve very similar results in the field experiment, suggesting that a weak treatment is unlikely to be the cause of the null results in 2014 and 2015.
While the the lab experiments and the 2018 survey experiments have relatively stronger treatments, they too could be relatively weak if, for example, subject comprehension was low. Descriptive statistics on subjects’ pass rates in identifying correctly the revenue source of each study exceed 95% (80%) in both treatment conditions in Uganda(Ghana), suggesting this is unlikely to be a concern (see Appendix C). Because of the size of our pooled samples, all three of our designs are well-powered, with minimum detectable Cohen’s d effects of 0.085, 0.171, and 0.138 respectively for the survey, lab, and survey-based field experiments respectively. These are well below the threshold of 0.2 Cohen’s d taken to be the upper bound for a substantively small effect. The use of covariate adjustment, particularly enumerator fixed effects, also substantially increased the precision of our estimates, making our effective power substantially higher than under a naive difference-in-means test.
Finally, as with the overwhelming majority of experimental work, all of our hypothesis tests adopt the conventional form in which the null is that the difference between the two conditions is zero and rejection implies a significant difference—in this case, that the accountability pressures generated by aid and oil revenues are not the same. As a robustness check, we also conducted an equivalency analysis (Hartman and Hidalgo 2018), an increasingly popular approach that inverts the conventional null and alternative hypotheses. In equivalency analysis, the null is that there exists a significant difference and the alternative is that there is no such difference. This constitutes a markedly harder test: because it begins by supposing that a treatment effect does exist, the burden of proof is on the researcher to reject the null—to present affirmative evidence, in other words, that there is no treatment effect.Footnote 16
The major degree of freedom for researchers in equivalency analysis is defining the range that is considered a substantively meaningful effect. Because the outcome scales were different across each experiment, we took special care to define the equivalency ranges in the same units as the behavioral index of each experiment so that they would have a natural substantive interpretation. In each case, we erred on the side of choosing the range conservatively, such that rejection of the null hypothesis would only occur for effects that were clearly small in substantive terms.
For the survey experiment, we chose an equivalence range of [-0.1, 0.1]; because the behavioral index was in standard deviation units, this corresponds to a treatment effect of 0.1 standard deviation units. This is approximately one-half the size of the 0.2 threshold that is often taken as the minimum substantively meaningful effect. For the lab games, we set a difference of +/- 5% of the mean subject threshold across all conditions as the equivalency range. For context, treatment arms that simulate direct taxation result in a 10% increase in accountability pressures relative to a baseline condition in which the group budget is derived from windfall revenues (see Cuesta et al., Forthcoming). We set the same equivalence range of +/- 5% for the survey-based field experiment, though its interpretation is slightly different. Because the behavioral index in this case is the average of four binary measures, this range corresponds to a 5 percentage point average change across all four measures.
With a rejection threshold of 95% (α = 0.05), we reject the null of meaningful difference in the pooled sample for all three of our behavioral indices (p = 0.025, 0.011, and 0.019 respectively). Even with the considerably higher burden of proof required in an equivalence test, we thus find no evidence that the two revenue sources generate differential accountability pressures.