1 Introduction

The Social Heuristics Hypothesis (SHH) stipulates that intuitive decisions drive cooperative behavior and that reflective control overrides a cooperative ‘default’ behavior to produce selfish decisions (Bear and Rand 2016; Rand et al. 2014). According to the SHH, intuitive decisions tend to rely on experience from games encountered in everyday life, where interactions typically are repeated and involve opportunities for sanctions; deliberation adjusts behavior to the optimal self-interested response in the situation at hand.

The SHH, however, conflicts with suggestions elsewhere in the literature that deliberative processing supports pro-social decision making (e.g., Achtziger et al. 2015; Martinsson et al. 2012; Stevens and Hauser 2004). Moreover, several studies have failed to find a relationship between pro-social behavior and canonical manipulations of cognitive processes (e.g., Hauge et al. 2016; Tinghög et al. 2013, 2016; Verkoeijen and Bouwmeester 2014). This includes a recent registered replication report by Bouwmeester et al. (2017), which sought to replicate the keystone time-pressure study in Rand et al. (2012) but did not find an effect of time pressure on cooperation. Yet, recent meta-analyses present results consistent with an overall positive effect of intuitive decision processes on cooperation (Rand 2016, 2017a, b). In sum, the literature on intuitive cooperation has grown sharply since the publication of the original time-pressure study by Rand et al. (2012)—but without reaching a resolution.

This paper presents an updated meta-analysis to add clarity to the literature. While we obtain an overall meta-analytic effect of the intuition manipulations on cooperation, we can attribute this effect to a specific class of induction manipulations. These manipulations ask participants to rely on emotion over reason in determining their resource allocation (Gärtner et al. 2018; Levine et al. 2018). Thus, we identify a single source of variation in the effect size that may account for inconsistent conclusions in the literature; when we exclude the six experiments that feature this specific manipulation—comprising just 7% of our total data set—we obtain no effect of intuition on cooperation, and the exclusion also yields a substantial reduction in systematic between-study variation. These results are problematic for the SHH as emotion-induction manipulations are vulnerable to alternative interpretations—and the SHH gives no reason for favoring this class of manipulations over others. Moreover, the dramatic dissipation of systematic heterogeneity, following removal of emotion-induction manipulations, runs counter to the idea that the intuitive cooperation effect, if present, is highly heterogeneous (Rand 2016). We also note that our results cannot be explained by between-study variation in participant compliance rates; we find no evidence that studies with higher compliance rates yield systematically higher effect sizes, speaking against the claim in Rand (2017a, 2019) that non-compliance explains why many studies find no effect of intuition manipulations on cooperation.

Our paper proceeds as follows. First, we present our data set and methods, then the analysis, after which we offer concluding remarks on the cognitive foundations of cooperation and the state of the literature.

2 Data and methods

Our inclusion criteria largely follow those in Rand (2016), who presented a meta-analysis to examine the effect of intuitive decision making on cooperation. The inclusion criteria define relevant experimental games and intuition manipulations. To be included in our meta-analysis, a study has to feature a controlled experiment—with monetary incentives and no deception—that used time pressure, cognitive load, ego depletion, or induction to manipulate cooperation.Footnote 1 The required intuition manipulations follow Rand (2016), exactly.Footnote 2

As for relevant experimental games, we depart slightly from Rand (2016) by focusing on games that capture cooperation in strategic interactions not contaminated by past or future choices, to ensure clear interpretation of the dependent variable. Therefore, we include only one-shot, simultaneous-move public goods games and prisoner’s dilemmas. This differs from Rand (2016), who in addition to simultaneous-move public goods games and prisoner’s dilemmas, also included second-player moves in sequential trust games and decisions from the last round of finitely repeated games. Nevertheless, to gauge how inclusion criteria affect our results, we perform robustness checks that also include sequential game decisions. Our final data set comprises 44 of 51 experiments included in the prior meta-analysis by Rand (2016), as most of his studies fit our inclusion criteria. In addition, we include 36 new experiments featuring 13,189 participants, an increase of 56.9% in the number of studies and an increase of 83.5% in the number of participants.Footnote 3 Table A.2 in Supplemental Online Material (SOM) A provides a full overview of the experiments comprising our data set, including the number of participants and details about game type and manipulation used.Footnote 4

Our inclusion decisions depart from Rand (2016) in two additional respects. First, our main analysis includes studies that provided in the experimental instructions information about time pressure. Rand (2016) argues that this introduces a potential comprehension confound—however, such challenges are inherent to these kinds of experiments regardless of when one introduces information about time pressure. Moreover, most of the data using this variation of the time-pressure manipulation originate from Tinghög et al. (2013), who successfully solved the issue of compliance plaguing other studies (e.g., Bouwmeester et al. 2017; Rand et al. 2012). For these reasons, we do not see adequate justification to exclude studies that, in the experimental instructions, inform participants about time pressure.

Second, all of our analyses include participants who did not comply with the experimental treatment, as excluding them would lead to selection bias. The meaning of ‘compliance’ depends on the specific manipulation type, and the compliance rate varies by type. Compliance is mostly an issue for the time-pressure manipulation (where non-compliance means not responding according to the time constraint) and induction manipulations (where non-compliance means that one has failed to follow instructions to write down something in an open field). Table A.5, in SOM A, displays compliance rates by manipulation type.

In his discussion of time-pressure experiments, Rand (2017a) argues that excluding non-compliers provides an improved picture of the effect and that such exclusion is justifiable due to the absence of correlation between observable factors and compliance with the time constraint. However, a re-analysis of Rand et al. (2014), Table A.1 in SOM A, shows that compliant participants are a selected subgroup—consistent with the argument that compliant-only analyses suffer from selection bias (Bouwmeester et al. 2017; Tinghög et al. 2013). Moreover, regardless of the outcome of balance tests, participants could self-select based on factors unobservable to the researcher. For this reason, we include non-compliers, and so all results must be interpreted as an ‘intention-to-treat’ effect. Still, the number of studies and participants featured in our meta-analysis allows for high statistical power to detect very small hypothesized population effect sizes (see SOM B for a detailed power analysis).Footnote 5

We subject our data set to a random-effects meta-analysis, which allows for systematic variation between studies by assuming that each true effect is drawn from a normal population distribution, with a common mean and between-study variance (Higgins et al. 2009).Footnote 6 This modeling assumption seems reasonable a priori, as several papers argue that the effect is heterogeneous (Mischkowski and Glöckner 2016; Rand 2018; Rand et al. 2014; Strømland et al. 2016). In line with Rand (2016), we use as our dependent variable percentage contributed of the total endowment, ensuring that our results are directly comparable to those in the previous meta-analysis. For decision problems with binary choice, such as the conventional prisoner’s dilemma, the dependent variable takes the value 100 if the participant cooperates, and 0 otherwise.

Analytically, our study differs from Rand (2016) in that we pay particular attention to sources of heterogeneity—systematic inconsistency across experiments. When there is large systematic inconsistency across experiments, it is hard to interpret the weighted summary effect produced by a meta-analysis.

In our meta-analysis, each effect size is computed as the percentage point difference between the treatment (intuition condition) and control group (deliberation condition). This means that the effect-size measure is bounded between 0 and 100. For studies retrieved from Rand (2016), we use reported effect sizes and standard errors, directly. For Bouwmeester et al. (2017), we follow the same procedure and retrieve the standard errors from the data reported. For other studies not included in either of the aforementioned data sets, we retrieved the data from regression tables where the percentage point difference between the treatment group and the control group was reported, and we normalized the effect size to a scale ranging between 0 and 100. For studies where this was not possible (e.g., if the main analysis conditioned on participants’ compliance status and the intention-to-treat effect was not reported), we downloaded the data and ran linear regressions of the normalized contribution rate on a dummy indicator for the intuition condition, using the estimated coefficient as a measure of the treatment effect (this estimator is equivalent to a simple mean difference between the intuition condition and the deliberation condition). We use robust standard errors in the regression and construct 95% confidence intervals (effect size ± 1.96SE, where SE is the standard error for the regression coefficient).

3 Results

We start by considering all experiments that meet our inclusion criteria. Figure 1 displays a forest plot of all experiments, including the overall effect with a corresponding 95% confidence interval. To the right of each estimate, we provide design details for the associated experiment.

Fig. 1
figure 1figure 1

Forest plot, all experiments

As Fig. 1 shows, the magnitude of the overall effect of intuition manipulations on cooperation is 2.19 percentage points, and this effect is statistically significant (p = 0.005, Z test). However, the magnitude of the overall effect is only 35.7% of the main effect reported in a prior meta-analysis that excludes non-compliers (Rand 2016) and only 52.1% the size of the intention-to-treat effect reported in that meta-analysis. This reduction in effect size may reflect the addition of individual lab estimates featured in the large registered replication study by Bouwmeester et al. (2017), which finds no effect of time constraints on cooperation. This pattern, in turn, is consistent with the ‘decline effect’ (e.g., Fanelli et al. 2017), whereby the influence in a meta-analysis of publication bias in an initial study dissipates with the number of replication studies added with no effect.

The overall effect may nevertheless not capture a psychologically relevant parameter; we can attribute 62% of the variation in the above forest plot to systematic differences between experiments (I2 = 61.9%, χ2(81) = 212.75, p < 0.001).Footnote 7 Moreover, the estimated between-study variance is large (\(\hat{\tau }^{2}\) = 27.08). As an illustration, note that the effect size varies from − 9 percentage points to 32. In summary, the analysis suggests an overall positive effect, but the experiments included exhibit very large variation in effect sizes, and that variation may, to a large degree, be attributed to factors other than chance.Footnote 8 As an overall effect size provided by a random-effects analysis is insufficient to summarize a heterogeneous set of studies (Raudenbush and Bryk 1985), our summary estimate should be interpreted with caution.

When a meta-analysis suggests large between-study variation, it is common practice to search for the sources of that variation (Higgins et al. 2003a, b). In our case, the observed heterogeneity may have several explanations. One possibility is that the intuitive cooperation effect is contingent on various background factors, as suggested in several papers (Capraro and Cococcioni 2015; Mischkowski and Glöckner 2016; Rand et al. 2014; Strømland et al. 2016), including Rand’s (2016) meta-analysis. Another possibility is that various manipulations, here grouped together as ‘intuition manipulations’, may work in different ways or even capture distinct psychological processes. That is, one may ask whether the observed inconsistency across studies is attributable to genuine and perhaps unpredictable variation in the underlying effect across study populations, or whether it is a by-product of the inclusion criteria. To distinguish between these possibilities, we turn to an analysis that separates manipulation types.

3.1 Comparing manipulations: meta-regressions

We use meta-regressions (see e.g., Thompson and Higgins 2002) to compare the intuitive cooperation effect across manipulation types. We take as a baseline experiments with time pressure, since time pressure is the manipulation type most frequently applied to induce cooperation. In SOM A (see Figures A.1–A.7), we provide meta-analyses specific to each manipulation type. In all individual meta-analyses but one, there is substantially less systematic between-study variation than there is in the overall analysis. The exception is that for induction manipulations (see Figure A.4), where the estimated heterogeneity is 83.1%—which is very high (Higgins et al. 2003a, b); this indicates that nearly all observed variation is attributable to genuine differences in the underlying effect across studies of this type. For this reason, we split induction manipulations into the following subcategories: (i) ‘emotion-induction’ manipulations instructing participants to rely on emotion over reason when making their choices, (ii) ‘recall induction’, and (iii) ‘other induction’ manipulations. The meta-regression results are displayed in Table 1. It is important to note that these regressions capture correlations, as we only have within-study randomization and no exogenous between-study variation.

Table 1 Meta-regressions of effect size (intuitive cooperation effect) on manipulation type

The meta-regressions yield several noteworthy results. First, Column (1) shows that only experiments using emotion-induction manipulations are significantly more effective in promoting cooperation than are time-pressure studies (coefficient = 14.88 percentage points, t(76) = 7.01, p < 0.001); the other manipulations are not significantly different from the small and non-significant effect estimated for the time-pressure studies (coefficient = 0.619, t(76) = 0.80). It is also noteworthy that ‘other induction’ manipulations yield an estimated effect very close to that of time-pressure studies, a mere 2.57-percentage point difference (t(76) = 0.81). Column (2) takes emotion-induction manipulations as the baseline and shows that all other manipulations are significantly less effective in promoting cooperation. Consistent with this, in Column (4), both time-pressure (t(79) = − 7.12, p < 0.001) and ‘pooled’ manipulations (t(79) = − 6.01, p < 0.001) are estimated to reduce the effect size by about 14 percentage points relative to emotion-induction manipulations. Together, these results justify our subdivision of the wider class of induction manipulations.

A funnel plot of all studies in the main analysis (see SOM A, Fig. A.11) illustrates the relative effectiveness of manipulation types; five out of six experiments using the emotion-induction manipulations appear as outliers, to the right of the 95%-confidence bar.

While Rand (2016) suggests that time-pressure manipulations are less effective than are other manipulations, our results indicate that only emotion-induction manipulations differ in their effect from other manipulations. We therefore proceed to test whether our overall meta-analytic effect depends on the emotion-induction manipulations, specifically; we conduct an alternative meta-analysis that includes all studies other than the six experiments using emotion-induction manipulations. This meta-analysis (see Fig. A.10) reveals no discernable overall effect on cooperation; the estimated meta-analytic effect is 1 percentage point (p = 0.076, Z test), and, judged by conventional classifications (Higgins et al. 2003a, b), heterogeneity is also quite low (I2(74) = 19.8%, χ2(75) = 93.50, p = 0.073, \(\hat{\tau }^{2}\)  = 4.43). Because time-pressure studies have been called into question, both for the size of their effect (Rand 2016) and their validity (Myrseth and Wollbrant 2017), we run a meta-analysis that excludes all emotion-induction and time-pressure manipulations, evaluating all other manipulations in the same test (Fig. A.9). In this meta-analysis, the estimated effect of the intuition manipulations is 1.62 percentage points—only 26.4% of the main effect reported in Rand (2016) and only 38.6% of that study’s intention-to-treat estimate—and not significantly different from zero (p = 0.177, Z test).

To ensure that our conclusions are not sensitive to inclusion criteria, we undertake additional robustness checks, using various combinations of Rand’s (2016) inclusion criteria while excluding the emotion-induction studies. In all tests, we follow Rand and include data on second movers and last-round moves in finitely repeated games. We also undertake robustness tests where we include data on trust game decisions, and tests where we include second-mover decisions only where the first mover contributed the maximum amount possible (as in Rand 2016). We carry out these robustness checks both for the specification excluding emotion-induction and time-pressure studies (Fig. A.9) and for the specification excluding only the emotion-induction studies (Fig. A.10). None of these robustness checks reveal a statistically significant overall effect; the estimated effect is consistently very small and insensitive to the inclusion criteria (see Table A.3 for details). Finally, it is worth noting that a separate meta-analysis of pre-registered studies (Bouwmeester et al. 2017; Camerer et al. 2018; Everett et al. 2017), only, leads to a similar conclusion; the effect size in this meta-analysis is just 0.79 percentage points and not statistically significant, and the estimated heterogeneity is low (see Fig. A.12).

A possible interpretation of our null result is that the ‘true’ effect size is very small, and that our result, when excluding emotion-induction manipulations, is a false negative. However, this interpretation would prove equally challenging to existing studies that report evidence for intuitive cooperation. Suppose that our upper bound on the effect size—1.8 percentage points in these eight specifications—represents the true effect size. Then, for a single study to have 80% power to detect the underlying effect, one would need a sample size of at least 15,486 participants (assuming a common standard deviation of 40 between treatment groups). Should the effect size instead be 1 percentage point, as in Fig A.10—which also corresponds closely to the effect size obtained using only pre-registered studies (see Fig. A.12)—one would need a sample size of at least 50,176 participants for a single study to achieve 80% power. Thus, even if our main finding were a false negative, the mean effect size in this literature is so small that to meaningfully study it one would need sample sizes an order of magnitude larger than those typically used in experimental studies. Any statistically ‘positive’ finding in this literature, obtained in typical sample sizes, would therefore likely represent a major overestimate (Gelman and Carlin 2014).

3.2 Alternative explanations

Rand (2019) responds to a pre-print version of our analysis by undertaking his own updated meta-analysis. He uses a combination of the data from Rand (2016) and those from our paper. His main argument is that our choice to exclude sequential games from the main analysis is responsible for the null effect obtained when we exclude emotion-induction manipulations. However, this cannot be the reason for the discrepancies between his new findings and ours—Table A.3 in our supplementary materials shows that our results are insensitive to the differences in inclusion criteria between Rand (2016) and our study.

Rand (2019) argues further that poor experimental designs may account for why there are many null findings in the literature. He suggests that future studies should move towards experimental designs that increase the compliance rate and comprehension of the game, and he expects these design features to be associated with substantially larger treatment effects. Related to the latter point, we note that the registered replication report by Bouwmeester et al. (2017) undertook a high-powered test of the hypothesis that comprehension moderates the time-pressure effect; they found no time-pressure effect in the comprehending subgroup. As for the hypothesis that greater compliance is associated with greater effect size, we are not aware of prior tests in the literature, so we undertake it here. Because compliance varies between manipulation types, we also undertake a separate test for studies using time-pressure manipulations. Figure 2 presents a scatter plot of compliance rate and effect size, for all studies included in our meta-analysis.

Fig. 2
figure 2

Study-level compliance rate against observed effect size

As Fig. 2 shows, there is no obvious relationship between the compliance rate of the study and the observed effect size, neither in general for all manipulations nor specifically for the time-pressure manipulations. The correlation is estimated in a meta-regression to be small and positive, but not statistically significant, neither for the full sample nor for the sample of time-pressure studies (regression results in Table A.4). Based on evidence available, therefore, it seems unlikely that a movement towards studies with higher compliance rates will have a major impact on effect sizes in this literature.

An alternative way of addressing the role of study compliance is to run the meta-analysis for compliant participants only (so that the effect size is computed for each study only for participants who complied with the time allotted), as the main analysis in Rand (2016) was conducted. We report such an analysis in Fig. A.13, where we run the meta-analysis for all studies, including compliant participants only. This analysis yields a positive and statistically significant association between the intuition manipulations and cooperation, even when excluding emotion-induction manipulations. However, conditioning on compliance status amounts to a ‘bad control problem’, as the treatment effect conditional on potentially endogenous variables warrants causal interpretation only under quite restrictive assumptions (Montgomery et al. 2018). Specifically, the analysis assumes that compliance, which happens after randomization, does not affect systematically the relative distribution of participants in the treatment versus control groups. This assumption is unmerited, however, as conditioning on compliance may plausibly change the composition in the treatment versus control groups differentially, such that these groups no longer are directly comparable. And, as seen in Table A.1, there is empirical evidence for the selection-bias argument—data sets in this literature indicate that there is self-selection into who complies or fails to comply with the treatment assigned. Finally, we would also note that absence of imbalance would not in itself amount to evidence against the selection-bias argument, as balance tests do not have 100% statistical power—and not all factors imbalanced between treatments are measured. In choosing to include non-compliant participants in our main analysis, we also follow recent meta-analyses in this literature (Fromell et al. 2018; Köbis et al. 2019; Rand 2019).

4 Conclusion

We present an updated meta-analysis of experiments that attempt to manipulate intuitive decision-making processes in games of cooperation. Our analysis tests the Social Heuristics Hypothesis (SHH), which stipulates that intuitive decision-making processes facilitate cooperative behavior. In examining both the overall meta-analytic effect and the origin of the between-study heterogeneity, we fail to obtain robust evidence for the SHH. Although we find evidence in favor of an overall positive effect of intuitive decision processes on cooperation, we can attribute this effect to a particular class of emotion-induction manipulations—those asking participants to rely on emotion over reason when determining allocation. Other manipulation types fail to yield a statistically discernable effect on cooperation. When we exclude the six studies with this manipulation type and conduct a meta-analysis on the remaining 76 studies, which comprise 93% of the observations in our full data set, we find that intuition manipulations have no effect on cooperation.

The consistency in findings across all manipulation types, save the emotion-induction manipulations, suggests that the latter produces a distinct effect. One possibility is that the transparency of the researchers’ intention in this setting—asking people to rely on emotion over reason—is understood as a request that participants cooperate, akin to an experimenter demand effect. A request to use your ‘heart’ could be seen as encouragement to be ‘nice’, whereas a request to use your ‘brain’ may indicate that you should try to calculate personal consequences (and not be gullible). The demand effect is less likely to apply to the other intuition manipulations (e.g., time pressure) as the link in those cases, between the researcher’s hypothesis of interest and the treatment, is less transparent. While a laboratory participant asked to decide within 10 s might suspect that the study is about the relationship between cooperation and making decisions fast or slow, the direction of the research hypothesis is not evident. Notably, direct requests that signal strongly potential underlying research objectives have been shown to strengthen experimenter demand effects (de Quidt et al. 2018).

An alternative, but perhaps less plausible possibility is that emotion induction is the only class of manipulations that successfully influences intuitive decision making. However, even if this alternative interpretation were true, it is worth noting that the SHH (Rand et al. 2014; Bear and Rand 2016) did not give reason a priori that this manipulation should work, whereas others should not. Related to this, one might wonder whether failure to comply with experimental instructions could account for our results, as compliance varies with study type. However, we do not find evidence for the hypothesis, put forward by Rand (2019), that studies with higher compliance exhibit higher effect sizes.

We also fail to find support for the idea that the underlying effect is highly heterogeneous (Rand 2016), as the removal of emotion-induction experiments from the meta-analysis reduces estimated between-study heterogeneity dramatically. This finding is consistent with the low between-study variation observed in the meta-analysis by Fromell et al. (2018), who study the effect of intuition manipulations on dictator game giving. We cannot rule out the possibility that we are underpowered to detect study-level heterogeneity, but it does appear that the meta-study by Rand (2016) overstates the importance of study-level heterogeneity for the effect of intuition manipulations. Nevertheless, tests for heterogeneity between studies will not necessarily pick up genuine individual-level heterogeneity, if such individual characteristics tend to be similar across study populations, and some studies argue that such individual-level heterogeneity is important for the link between intuition and cooperation (e.g., Alós-Ferrer and Garagnani 2018). One recent study on time-pressure effects in the dictator game tests more directly for such individual-level heterogeneity (across a large set of potentially relevant variables) and finds little evidence for it (Strømland and Torsvik 2019).

As our study focuses on cooperation, we cannot rule out that intuition influences other forms of pro-social behavior. According to Rand et al. (2016), the SHH also predicts intuitive altruism in women, but not men. While their meta-analysis finds support for this prediction, a more recent meta-analysis by Fromell et al. (2018) finds for men a negative effect of intuitive decision processes on altruism and for women no effect.

At a more general level, our findings also speak to the current discussion on heterogeneity in effect sizes in psychology and economics (DellaVigna and Pope 2018; Klein et al. 2014; McShane and Böckenholt 2014; van Aert et al. 2016). Meta-analyses in psychology typically suggest substantial systematic heterogeneity in effect size (Stanley et al. 2018), but the recent ‘Many Labs’ projects find relatively low systematic variation in effect size across various contexts and cultures (Klein et al. 2014, 2018). Consistent with this, studies by DellaVigna and Pope (2018) indicate that effect sizes tend to be more stable across settings than predicted by expert forecasts. Our meta-analysis is consistent with these findings, and it shows that estimated treatment effect heterogeneity in meta-analyses can be surprisingly sensitive to inclusion criteria; when we include the emotion-induction manipulations, heterogeneity is high—but when we exclude them, heterogeneity is low. Our evidence thus highlights the possibility that some of the heterogeneity reported in meta-analyses arises from researchers’ inclusion decisions—as opposed to genuine variation in the effects under scrutiny.