General anchoring effect
We asked 42 questions on six different topics over a month. Each question was answered by an average of 219 users, with an average of 58 participants in the treatment condition (i.e. Groups A and B) for each question. The full list of questions and hints is provided in Supplementary Table 1. The actual results (i.e. the correct answers) do not affect the experiment or analysis, given they were known and revealed only after the experiment was over. The answer distributions to the questions shown in Table 1 are shown in Fig. 2. Yuen’s test for independent sample means with 15% trimming was applied to all questions to test whether the predictions in the two treatment groups are different from each other. The results indicate that the answer distributions are indeed significantly different (p < 0.01) in all questions except one, which is marginally significant (p = 0.05).
Eight questions providing irrelevant information in the first hint were included in the experiment as a check to determine whether control group participants are blindly following the values provided in the first hint without questioning their actual relevance for the target outcome (see the questions in Supplementary Table 1). If this were the case, it would be inferred that the cognitive effort invested by participants in this game is minimal and that the observed responses are not due to anchoring effects but to users’ laziness. The detailed results of this test are provided in Supplementary Figs. 1–2, but in short, it was confirmed that the players of the game, generally pay attention to the relevance of the hint to the question and do not follow the values in the hint blindly.
For 42 standard questions, the anchoring stimulus (Eq. (1)) and the anchoring response (Eq. (2)) were calculated and shown in Fig. 3.
$${\text{Stimulus}} = \frac{{\left| {{\text{ Hint Group A}} - {\text{Hint Group B}} } \right|}}{{\sigma_{{\text{Control Group}}} }}$$
(1)
$${\text{Response}} = \frac{{\left| {{\text{ Median answer Group A}} - {\text{Median answer Group B}} } \right|}}{{\sigma_{{\text{Control Group}}} }},$$
(2)
where \(\mathrm{Hint\,\, Group}\,\,X\) is the numerical value of the provided hint to group X and σ is the standard deviation of all the answers in the group.
As expected, a larger stimulus leads to a larger anchoring response, however, after an initial increase in the size of the induced anchoring effect, saturation appears.
A closer look at examples shown in Fig. 2, however, reveals that in some cases the diversity of answers in the treatment groups is very small (left panel) whereas in other cases the answers are widely dispersed (right panel). This observation suggests that for some questions the provision of two anchors lead to higher collective prediction certainty when compared to the control group, while for other questions it introduced more doubt about the true value of the likely outcome among the group members.
This observation warrants a more systematic analysis of the ratio of the treatment to the control group’s standard deviation depending on the size of the provided stimulus. Figure 4 illustrates how the relative group diversity—that is the diversity of answers within each treatment group divided by the diversity of predictions in the control group (Eq. (3))—changes with the group stimulus (Eq. (4)).
$${\text{Relative group diversity}}\,\, X = \frac{{\sigma_{{{\text{Group}}\,\, X}} }}{{\sigma_{{\text{Control Group}}} }},\quad {\text{where }}X \, = {\text{ A}},{\text{ B}}$$
(3)
$${\text{Group stimulus}} = \frac{{\left| {{\text{ Hint Group}} X - {\text{Median answer Control Group}} } \right|}}{{\sigma_{{\text{Control Group}}} }},\quad {\text{where }}X \, = {\text{ A, B}}$$
(4)
Smaller anchoring stimuli lead to smaller relative group diversity, which means that treatment group users were collectively more certain of their answers compared to the control group for these questions. When the size of the anchoring stimulus increases, the relative group diversity also increases.
Based on this observation, we define a new measure of the anchoring bias by normalising the difference between the median answers of the two treatment groups by the average of the standard deviations of the two treatment groups instead of the standard deviation of the control group (Eq. (5)).
$${\text{Modified response}} = \frac{{\left| {{\text{ Median answer Group A}} - {\text{Median answer Group B}} } \right|}}{{\left( { \frac{{\sigma_{{\text{Group A}}} + \sigma_{{\text{Group B}}} }}{2} } \right)}}$$
(5)
The resulting modified response accounts for the participants’ collective confidence in their predictions. A small average standard deviation of the two treatment groups is assumed to be indicative of higher certainty of answers among participants since users appear to be in collective agreement regarding the true value of the target. The modified response function is shown in Fig. 5.
Medium-sized stimuli (2 < x < 5) seem to have caused the majority of participants to believe that the anchors might be plausible, resulting in a larger modified response, i.e. a larger difference in the answers of the two groups with higher collective confidence in each group. High anchors induce more uncertainty among participants: not all users follow high anchoring stimuli, instead, a considerable proportion of participants starts adjusting their predictions to less extreme values, thus increasing the diversity of answers.
“Expert-opinion” anchors
By substituting the factual information in the second hints by fictitious “Play The Future-prediction” values (which we carefully selected to resemble realistic predictions) in additional 12 questions, we examined how strongly these values impact participants’ predictions. The hints of these questions presented as the prediction of a hypothetical member of the Play The Future team. An example is given in Table 2 and the rest of the questions are in Supplementary Table 1.
Table 2 Example of experiment question with fictitious hints The upper panel of Fig. 6 shows the size of the anchoring effect versus the anchoring stimulus for 42 standard and 12 PTF-prediction style questions (note the data points belonging to the standard questions in Fig. 6 are identical to ones of the Figs. 3, 4, 5 and are repeated here for the purpose of comparison). It becomes immediately apparent that the questions containing PTF-prediction values in the treatment hints result in consistently larger responses compared to the standard questions with factual information. The middle panel of Fig. 6 shows that not only the medians of treatment groups distributions are moved further with the fictitious hints, but also the relative group diversity tends to be lower for PTF-prediction questions, which is the result of less variation in the treatment groups’ answers compared to the control group’s predictions. Hence, in the modified response curve is shown in the lower panel of Fig. 6 we see an even larger amplification of the anchoring effect emerged from the fictional hints.
Individual analysis
To analyse the effect of the anchoring stimulus provided in the experiment on each participant’s predictions, the median bias for each user during the experiment was calculated. Firstly, the normalised difference between the user’s prediction and the control group’s median prediction was computed for each question (Eq. (6)).
$${\text{Bias per participant }} i {\text{ per question}} = \frac{{\left| { {\text{Prediction}}_{i} - {\text{Median answer Control Group}} } \right|}}{{\sigma_{{\text{Control Group}}} }}$$
(6)
Next, the median bias was calculated per individual user for all questions answered in the control condition and for all questions answered in the treatment condition. A higher individual bias indicates that a certain user’s prediction values are further away from the median of the control group’s predictions, which may be the result of the influence of the anchoring stimuli.
User engagement
It could be hypothesised that the anchoring effect is stronger in less experienced users. We, therefore, tested for a difference between participants who answered less than half of all experimental questions (casual users) and those who made predictions for more than half of the questions (“loyal users”).
Results are shown in Supplementary Fig. 3. Focussing on control groups only, we observe that loyal users in the control group seem to make predictions that resemble the median control group’s predictions more closely than casual users in the control group (Welch’s t-test significant t = 6.986, p < 0.001). The highly engaged users may have concluded that making moderate rather than extreme predictions (if no further information in the form of a second hint is provided) constitutes a relatively successful strategy in this game.
However, among the treatment groups, barely any difference between casual and loyal participants can be detected (Welch’s unequal variances t-test insignificant t = 1.448, p = 0.149). This implies that all users regardless of their level of engagement on the app are roughly equally susceptible to the provided anchoring stimuli. Thus, it is concluded that even among high-frequency players no ‘learning effect’ regarding the true purpose of this experiment occurred.
User prior accuracy
Many of the experiment participants had already played the game and the records of their predictions were available to us. However with the caveat that there were not enough prior data for each category of questions and we had to calculate users’ prior accuracy based on their average performance at the aggregate level rather than a question specific level. This “coarse-graining” could be rectified in future work. We divided the players into high accuracy and low accuracy groups based on their accuracy score in all the games they had played before our experiment and compared the induced bias for the two groups (see “Methods” section for details).
Considering only control groups, participants who made less accurate predictions before the start of the experiment provided answers that were relatively close to the overall median answer during the experiment (Supplementary Fig. 4). Previously better-performing users made more distinct predictions, potentially because they put less trust in the information provided in the first hint. The results of Welch’s t-test indicate that the individual biases in the two control groups are slightly different from each other (t(139.11) = 1.933, p = 0.055), however, this result is statistically significant only at the 0.1-level.
For previously low-performing users in the treatment group, the individual bias seems to be slightly larger compared to the bias observed for previously well-performing users. This would imply that their answers were slightly more influenced by the anchoring stimuli compared to better-performing users. However, Welch’s t-test reveals that the visually observed difference is not statistically significant (t(97.99) = − 1.443, p = 0.15).
Gender
Finally, we analysed the users based on their gender and compared their cross-group errors (see Supplementary Fig. 5). Even though Welch’s t-test with unequal variances confirms that this difference between males and females in the control condition is indeed significant (t(349.86) = − 2.412, p = 0.016) meaning that female users tend to make predictions that are closer to the overall median answer compared to male users in the control conditions, both the visual analysis and Welch’s t-test for the same comparison in the treatment condition show that there is no difference among male and female individual biases (t(194.92) = − 0.929, p = 0.354). Thus, both sexes appear to be equally susceptible to the anchoring stimuli provided in the treatment condition.