We first provide descriptive evidence to support that children successfully understood and played a strategic interaction game in the form of an experimental BCG. Then, we turn to the results regarding important determinants of performance in the BCG and discuss our findings. Finally, we report the results from our replication study with an adult sample.
Descriptive results
As noted, we had 114 participants in our sample, 63 of whom were female. Distributions over the classes and grades as well as summary statistics for fluid IQ and understanding of the game can be found in Table A1 in the Appendix.
To assess how well children understood the game, each child had to explain the five steps of the game to an experimenter. For each step, a child’s level of understanding was rated by the experimenter with a maximum of four points, i.e., a child could achieve at most 20 points (see Sect. 2.3). Overall, 90 children (80%) were rated with 19 or 20 points, only five children (4%) received less than 17 points (see Table A1 in the Appendix).Footnote 9 Hence, experimenter ratings indicate a very high level of understanding of the game.
To analyze children’s choices in the BCG, we compare the results from the present study with other data from experimental BCG studies. First, in Fig. 2 we plot mean choices (for median choices, see Tables 1 and A2 in the Appendix) over the 10 rounds and compare them with the choices from Nagel’s seminal paper (1995, using sessions with \(p=2/3\) and sessions with \(p=1/2\)) as well as with our own replication with an adult sample (for details on the replication with adults, see Sect. 3.4).Footnote 10 Generally, we find that the average number chosen by children is decreasing over the 10 rounds, with the bulk of the decrease occurring in rounds 1–6. Thus, children’s choices seem to converge toward the game-theoretic equilibrium over time, a finding that is well established in other studies with adult samples (e.g., Nagel, 1995; Duffy & Nagel, 1997; Ho et al., 1998). In addition, children’s choices mimic both the level as well as the rate of decrease found in Nagel (1995)—however, the children sample is very close to the sessions with \(p=2/3\) in the Nagel study, indicating that children’s choices start on a higher level (but develop with a similar rate of decrease). In other words, children seem to play a BCG with \(p=1/2\) using the new design proposed in our study in a very similar manner to the way that adults play the classical \(p=2/3\) experimental BCG.Footnote 11 Table 1 reports the detailed values comparing the first four rounds of the game across the four samples. Mean and median numbers for the children are largely comparable with the numbers from Nagel (1995) in the sessions using \(p=2/3\) (testing children’s choices in rounds 1–4 against choices in the corresponding rounds in the Nagel \(p=2/3\) sample reveals no significant difference, Mann–Whitney U tests for each round, all \(p > .164\)).Footnote 12
Table 1 Comparison of Mean and Median Numbers with Nagel (1995) Table 2 Comparison of depth of reasoning with Duffy and Nagel (1997) Second, to further support the notion that children played the new design of the experimental BCG in a way that is largely comparable with adults, we benchmark our data with results on the average “depth of reasoning” from Duffy and Nagel (1997). We apply Duffy and Nagel’s definition of the depth of reasoning by calculating the depth of reasoning d for individual i in group j for round t solving:
$$\begin{aligned} number_{i,t} = median_{j,t-1}*p^{d_{i,t}} \end{aligned}$$
Table 2 reports the values for our children sample as well as our replication with an adult sample and the comparable numbers from Duffy and Nagel (1997). Note that for this comparison, the game parameters of both designs match exactly, i.e., the data from Duffy and Nagel (1997) are also based on \(p=1/2\) and the median (subjects in their study were undergraduate university students). Table 2 clearly shows that, in all three samples, the majority of players chose numbers in the range of \(d=1\) (except for round 4 in the Duffy and Nagel sample), but in Duffy and Nagel slightly more mass lies on higher values of d compared with our design. Testing whether the distributions for levels of depth of reasoning are equal across samples reveals that children choose significantly different from the Duffy and Nagel sample in round 1 (\(\chi ^2 = 11.66, p = .040\)) and in the subsequent rounds 2–4 (all \(\chi ^2 > 24.51\), all \(p < .001\)), with more probability mass in lower levels of depth of reasoning. Overall, adult university students in the Duffy and Nagel sample played with a slightly higher depth of reasoning than the children in our sample; however, considering the fact that we played the BCG with children aged 9–11 years, the distributions of d are surprisingly comparable.Footnote 13
Combining the similarity of choices over the rounds with the fact that these choices were taken based on a correct understanding of the rules of the BCG, we derive our first main result:
Result 1
Using a new design of the experimental BCG, children aged 9–11 years are able to understand and play a strategic interaction game like the experimental BCG used in many other studies with adults. While children start with slightly higher numbers than adults, the rate of decrease and depth of reasoning are, by and large, comparable with values for adults.
Determinants of successful performance
Now, we turn to the question of whether fluid IQ, an important part of cognitive skills, is associated with high strategic interaction skills, i.e., successful performance in a BCG.Footnote 14 Importantly, all analyses presented are of a correlational nature; based on our study design we cannot make any causal claim. Nevertheless, we believe that the analyses provide interesting insights into the importance of fluid IQ for choices in an experimental BCG, once the abstract version of the BCG is transformed into an easy-to-understand board game.
We present results from ordinary least squares (OLS) regressions; all models control for group fixed effects (i.e., we only look at differences within a group of five children playing the BCG) as well as gender and age. All standard errors are clustered at the group level. Gender is a binary variable; age is measured in years (but varies with a day-to-day precision between children). Fluid IQ is standardized to mean = 0 and SD = 1 to make interpretation easier (details on how we formed our outcome variables can be found in Sect. 2.4).
Fluid IQ was measured using a Raven’s CPM test. We collected fluid IQ scores for all 114 children in our sample. Raw scores range from 5 to 18 and show large variation. The average score is 14.7, median score is 15, and the standard deviation is 2.8 (the full distribution of scores can be found in Figure A1 in the Appendix). Therefore, we believe that we have a meaningful measure of fluid IQ to proxy an important part of general cognitive skills.
In a first step, we analyze the link between fluid IQ and the probability of choosing a weakly dominated number, i.e., a number larger than 50.Footnote 15 We estimate the likelihood of choosing a weakly dominated number in the first round (in which no other behavior has been observed prior to the choice) and for all 10 rounds of the game, both using linear probability and probit models. As documented in Table A3 in the Appendix, there is a negative correlation between fluid IQ and the likelihood of choosing a weakly dominated strategy (statistically significant in three out of four specifications). The effect of fluid IQ is also large in size: moving from an average fluid IQ to a fluid IQ one SD above the average score substantially decreases the likelihood of choosing a weakly dominated strategy in the first round (from 27.1 to 16.7%, see column (2) in Table A3).Footnote 16
Next, we investigate the relationship between the number chosen in the first round and fluid IQ. Previous studies show a negative relationship between cognitive skills and first-round choices in BCGs, i.e., individuals with higher cognitive skills choose closer to the game-theoretic equilibrium of zero (Burnham et al., 2009; Carpenter et al., 2013). However, in our setting there is no significant relationship between fluid IQ and and the number chosen in the first round (for more detailed findings and a short discussion of first-round choices, see Sect. B and Table A4 in the Appendix).
Table 3 Determinants of Successful Performance in the Game To analyze the predictive power of fluid IQ for strong strategic interaction skills, we first examine the number of coins a child has won, i.e., how many rounds the child won during the experimental BCG. In all subsequent analyses, we exclude results from the first round because here no prior interaction has taken place and the children cannot condition their choices on the observed behavior of their peers. Table 3, column (1), presents our findings for regressing the number of coins won on the dispositional characteristics of the children. It shows that neither gender, age, nor fluid IQ significantly explain variation. Hence, in this setting, cognitive skills, measured as fluid IQ, are not related to successful performance in the experimental BCG.
Using the number of coins (or rounds) won is easy and straightforward but there is a caveat: this measure disregards any difference in children’s performance apart from being “the best” in a given round. For example, in these analyses, a child who fails to win a coin by only one step is considered equal to a child who misses half the median by 30 or more steps. Obviously, the latter child can be thought to have performed much worse than the first child. Thus, we also want to present an analysis accounting for the variations in performance between all five children in a group, not only an analysis of the winning child versus the non-winning children. To do this, we calculate the “distance to the best response”, that is, how far a child is from the choice that would make him or her win the round, given the other children’s choices (see Sect. 2.4 for details). Because this measure is very heterogeneous both across rounds as well as across groups, we rank children within groups and rounds based on their distance to the best response (i.e., as noted above in Sect. 2.4, the child with the shortest distance—the winning child—receives rank 1, the child with the second-shortest distance receives rank 2, and so on). We can then calculate an average rank for each child over rounds 2–10, with a “good performance” corresponding to a low average rank. We report the results from regressing average rank on the personal characteristics of the child in Table 3, column (2). Our findings confirm results from column (1): neither gender, nor age, nor fluid IQ are related to average rank over the rounds. In column (3), we analyze the average distance instead of the rank—results are very comparable.Footnote 17
Next, we analyze children’s behavior over the rounds to understand how children adapt their choices over time. To do so, we use the panel structure of our data setting the child as the panel variable and rounds 1–10 as the time variable. We cannot use conventional linear panel models in our setting because the unobserved panel-level effects will very likely be correlated with the lag of the dependent variable, in our case the number chosen in the previous round. We therefore use the Arellano–Bond estimator (Arellano & Bond, 1991) which basically instruments the first difference of interest (here choice in round t and \(t-1\)) with the second (and higher order) lag (e.g., choices in round 5 are instrumented by choices in rounds 3, 2, and 1).Footnote 18
Table 4 presents results for the relationship between the number in round t and the following variables: numbers chosen in round \(t-1\), winning number in round \(t-1\), and position of the goblin in \(t-1\). We see that (1) children on average choose lower numbers across rounds (the coefficient of number in round \(t-1\) is < 1 in all models, confirming the results from our previous analyses), (2) children do—to some degree—“stick” to their number chosen in a previous round (the influence of choice in round \(t-1\) is significant in all specifications), but (3) are more strongly influenced by the position of the goblin in the previous round (column (3) clearly shows that the position of the goblin in round \(t-1\) has the strongest influence on the number chosen in round t). Because Arellano–Bond estimators do not allow for time-invariant controls, we conduct a median sample split in order to at least qualitatively analyze differences in learning behavior with respect to fluid IQ. Column (4) is based on the \(n = 48\) children in our sample with a below-median score in the fluid IQ test, and column (5) is using the remaining \(n = 66\) children (at or above median IQ score). Results from these two models point to two potential mechanisms how fluid IQ might affect learning behavior: First, it seems that children with lower fluid IQ tend to more strongly “stick” to their choices from the previous rounds (comparing the coefficients of Number Chosen in \(t-1\)). Second, the findings seem to suggest that both groups pay attention to the position of the goblin in round \(t-1\) when choosing a number in round t, but that only children with a higher fluid IQ do potentially account for the position of the winning child (although this effect is statistically not significant). The latter strategy, however, seems important for successful game performance because when forming beliefs about the other players’ next round choices it is potentially important whether the winning child was above or below the goblin’s position (i.e., 50% of the median number). Of course, given sample size and power, these findings have to be interpreted cautiously and can only point to interesting mechanisms for further research.
Table 4 Prior choices and learning over rounds Finally, we also estimate how a child’s depth of reasoning is related to individual characteristics. Note that a high level of depth of reasoning in itself is not necessarily a predictor of good performance in the game—children could display an excessively high depth of reasoning and “outsmart themselves” by choosing numbers that are too low (cf. Kocher & Sutter, 2005). Table A5 in the Appendix reports that there is no relationship of gender, age, or fluid IQ with a child’s average depth of reasoning across rounds. Thus, we can conclude our second main result:
Result 2
In our new design of the experimental BCG, cognitive skills—measured as fluid IQ—are neither related to first round choices nor to successful performance but only predict the choice of weakly dominated strategies. Similarly, fluid IQ is not associated with a child’s average depth of reasoning.
Taken together, our findings indicate that strategic interaction skills in the new design of the experimental BCG are not linked to gender or age (within our age range of 9–11 years). Fluid IQ is only relevant in predicting whether children choose weakly dominated strategies but is not associated with first round choices, successful performance, or higher depth of reasoning in the game. To support the stability of our findings, we conducted several robustness checks (excluding children with low understanding, excluding weakly dominated choices, estimations without group-fixed effects). In all versions, our findings remain stable (for details, see Section C in the Appendix).
Discussion
Previous studies have generally demonstrated positive links between cognitive skills and lower entries in the experimental BCG (Burnham et al., 2009; Brañas-Garza et al., 2012; Carpenter et al., 2013). We also find a link between cognitive skills, measured as fluid IQ, and choosing weakly dominated strategies, replicating findings from Burnham et al. (2009, p. 172). Yet, Burnham et al. (2009) also document a relationship between higher cognitive skills and lower numbers chosen in a one-shot BCG which we cannot replicate (first-round choices are not linked to fluid IQ, see Table A4 in the Appendix). We believe that the specific measure for cognitive skills could be an aspect that helps explain this: Brañas-Garza et al. (2012) use a Raven’s IQ test, as we do in our study, and also find no relationship between fluid IQ and choices in an experimental BCG (yet, they do report a significant link to the Cognitive Reflection Task). Moreover, our sample consists of children aged 9–11 years and it is possible that, for this age group, other abilities simply matter more than IQ; however, this explanation is made less likely by the fact that in our replication using an adult sample, there is no significant relationship between fluid IQ and successful performance (see Sect. 3.4). This leads us to the explanation that we consider most plausible. The difference between our results and previous findings could be driven by the fact that in all these studies, the instructions for the BCG were abstract and, therefore, cognitive skills were more important (or even a prerequisite) for understanding the mechanisms of the game. In our setting, the instructions are far more concrete and the game itself has a visual and spatial representation. In other words, choices and their consequences are mapped into concrete and observable operations. Thus, our speculative hypothesis is that the new design of the experimental BCG lowers the importance of cognitive skills for successful performance in the game. Note that by removing the requirement to translate abstract instructions into concrete operations, we can study actual behavior in strategic interaction settings in a much more focused way. Indeed, real-world strategic interaction is often characterized by repetition, observable behavior, and concrete outcomes, as well as possibilities to learn from one’s choices. Hence, removing (or lowering the demand for) this abstract component from the experimental BCG might actually increase external validity for real-world strategic interaction.
There are two methodological challenges with the lack of significant relationships between fluid IQ and successful performance. First, this could be due to restricted or limited variance. If, for example, classes or groups were very homogeneous with respect to fluid IQ levels, this could (partially) explain why there is no significant link between fluid IQ and performance. To analyze this concern, we checked the variance of the results from the Raven’s Matrices task within our sample. For the whole sample, the variance in raw scores for fluid IQ amounts to 8.0 points. Calculating the within-class variance and then averaging over these within-class variances for all classes (weighted by class size) results in an average variance at the class level of 6.4 points.Footnote 19 Because we assigned children randomly to groups (within a class), we expect that the average variance at the group level would not differ from that at the class level. Indeed, the average within-group variance amounts to 6.6 points. Finally, we can compare the variance in our study with figures from a different study using the same test for fluid IQ (Berger et al., 2020). In four different testing waves with a sample of more than 500 German primary schoolchildren aged 7–9 years, this study finds variances of 8.3, 7.1, 8.1, and 6.3 points, respectively. Thus, the variance of fluid IQ in our sample (and within our groups) is substantial and in line with other, much larger samples of schoolchildren. An alternative way of testing for within-group-restricted variance as a potential explanatory factor is to estimate the OLS models from Table 3 without group-fixed effects. In doing so, we exploit the full distribution of IQ scores within our sample (but also lose control over other factors varying between groups). We report these estimations in Table A9 in the Appendix; there is no significant link between fluid IQ and any of our measures of successful performance when excluding group-fixed effects.
Second, the lack of a significant relationship between fluid IQ and successful strategic interaction could be due to limited statistical power of our study. However, comparing our results for the first round choices with findings from Burnham et al. (2009, they use a BCG with \(p = 1/2\) and choices between 0 and 100), we see that they report an effect of \(-9.67\) in first-round choices for a one standard deviation increase in cognitive skills (p. 173). In contrast, if we regress the number chosen in round 1 on gender, age and fluid IQ (clustering standard errors at the group level), the 95% confidence interval for a one standard deviation increase in fluid IQ on the number chosen in round 1 ranges from \(-7.26\) to \(4.29\); this suggests that we can basically rule out effects of fluid IQ in the size found by Burnham et al. (2009). In addition, our exploratory analysis on the link between perspective-taking abilities and successful performance in the BCG (see Sect. D in the Appendix) indicates that for other individual characteristics (even a binary one), our study seems to be sufficiently powered to identify statistically significant relationships.
Finally, we do not identify any significant relationship between age and successful strategic interaction. In principle, an increase in strategic interaction skills with age could be anticipated (e.g., Brosig-Koch et al. (2015) show that children’s ability to reason backward clearly improves from the age of 6 years onward, and Charness et al. (2019) show with a sample of children between 3 and 11 years old that Theory of Mind increases considerably with age). On the other hand, Czermak et al. (2016) find no substantial effects of age in strategy games and conclude that their results suggest that “strategic decision-making is fairly well developed at an age of 10 years and hardly changes in subsequent years” (p. 270). Potentially, this conclusion could already apply at the age of 9 years, which is the lower bound of age in our sample of children. However, our results regarding age must be interpreted with caution because (1) we study a rather small age range of only two years, (2) we only compare children within groups who are even more homogeneous in age than the full sample (because groups were randomly drawn from the same class), and (3) our distribution of age is not linear because we study a cohort of third and fifth graders (i.e., there are no fourth graders in the sample). Hence, identifying age effects in such a setting is challenging and the absence of significant age differences in successful strategic interaction in our study should not be interpreted as evidence of the absence of development in strategic interaction skills within this age range.
Replication with an adult sample
When we tested the new design of the experimental BCG with our sample of children and compared it with the previous findings in the literature, we changed two factors simultaneously: the design of the game and the sample of participants. To provide the “missing piece”, we replicated our study using the new design of the experimental BCG but with an adult sample of university students.
Experimental Design. We recruited \(n=120\) participants, 60% of whom were female, with a mean age of 22.6 years (see Table A12 in the Appendix for details). The experiment was conducted in the MABELLA (Mainz Behavioral and Experimental Laboratory). The sessions were combined with another experiment but mirrored the basic structure of the study with children: first, all participants within a session (\(n=10\)) were seated at separate tables in a large room and filled out questionnaires and tests (including a (short) version of Raven’s Matrices for adults). Subsequently, two groups of five adults (randomly assigned) each went to a separate room with an experimenter to play the new design of the experimental BCG. The only difference with the study with children was that adults did not receive one-to-one instructions but were instructed as a group (also, they did not have to explain the game back to the experimenter). After the BCG, participants were paid anonymously in a separate room. We conducted 12 sessions with two groups each, sessions lasted for 70–90 min in total. Average payoff was EUR 15.45, including a show-up fee of EUR 5. The experimental BCG was incentivized, with the winner of a randomly drawn round receiving EUR 20. The fluid IQ test was not incentivized.
Results. Figure 2 and Table 1 show that choices by adults start at a lower level than those by children and remain consistently lower in subsequent rounds (Mann–Whitney U test for rounds 1–4 for each round, all \(p <.0001\)). More detailed information on adults’ choices can be found in Table A13 and Figure A3. Benchmarking adults’ choices with choices in the study by Nagel (1995) suggests that our adult sample playing the BCG in the new design behaves very similarly to the adult sample in Nagel’s study playing the classical BCG with \(p=1/2\). Comparing the numbers chosen in the first round for our adult sample and for the Nagel \(p=1/2\) sample reveals no significant difference (Mann–Whitney U test, \(p=.899\)). However, in rounds 2–4 the Mann–Whitney U tests are significant, suggesting that participants in the Nagel \(p=1/2\) sample choose lower numbers than those in our adults sample (Mann–Whitney U tests for each round, all \(p<.032\)), which is in line with the notion that median choices decrease somewhat faster in the classical BCG, see Table 1.Footnote 20
When comparing the distributions of depth of reasoning in rounds 1–4 in Table 2, the majority of adults display a \(d=1\). If we compare the distributions of levels of d, we see that our adult sample is significantly different from the children sample in round 1 (\(\chi ^2 = 23.10, p < .001\)), with adults having more probability mass in higher levels of reasoning, as one would expect. In later rounds, however, the picture is mixed.Footnote 21 Comparing the depth of reasoning in our adult sample with the adult sample in Duffy and Nagel (1997), we find that for round 1 the distributions of d are not significantly different (\(\chi ^2 = 4.95, p = .422\)). In subsequent rounds, adults in our sample show lower levels of depth of reasoning than the adults in Duffy and Nagel (for rounds 2–4, all \(\chi ^2 > 9.82\), all \(p < .080\)). Overall, levels of d in our adult sample are slightly higher than in our children sample and slightly lower than in the adult sample by Duffy and Nagel (1997), which places the distribution of our adult sample playing the new design of the BCG between the children sample playing the new design of the BCG and the adult sample by Duffy and Nagel (1997) playing the classical design.
In total, the replication of the new design of the experimental BCG with an adult sample shows that this design can be used successfully to study strategic interaction with adults:
Result 3
Adults playing the new design of the experimental BCG behave in a way that is largely comparable with adult behavior in the classical BCG. The average numbers chosen and the rate of decrease are very similar to those chosen by other adult samples. The average depth of reasoning is higher than that in our children sample and slightly lower than that in the classical BCG.
Although this was not the focus of our replication study, we also conducted a parallel analysis of the link between cognitive skills, measured as fluid IQ, and successful performance for adults. Results can be found in Table A14. Similar to the children sample, successful performance is not related to fluid IQ. This indicates that the new design might indeed place lower demands on cognitive skills by making the game less abstract and easier to understand (see Sect. 3.3). In addition, we document a substantial gender difference in the adult sample. Women perform significantly worse than men when looking at the number of coins won and the average rank.Footnote 22 Taken together, the replication study using an adult student sample generally confirms that the new design of the experimental BCG can also be used with adults. Further investigating determinants of successful performance and gender differences in strategic interactions for adults as well as skills that have a causal effect on successful performance appear to be promising avenues for further research.