Subjects Increase Exploration when Information can Subsequently be Exploited
In this exploration task we manipulated the number of apples to be picked on each trial to encourage exploration (Dubois et al., 2021). In the long horizon, six different apples could be picked in sequence, which promotes initial exploration because gaining new information could improve later choices.
To assess whether a longer decision horizon promoted exploration in our task, we compared which bandit subjects chose in their first draw in the short and in the long horizon condition. For each trial we computed the familiarity (the mean number of initial samples shown) and the expected value (the mean value of initial samples shown) of each bandit. In the long horizon condition, subjects preferred less familiar bandits (horizon main effect: F(1, 94) = 5.824, p = 0.018, η2 = 0.058; age main effect: F(2, 94) = 0.306, p = 0.737, η2 = 0.006; age-by-horizon interaction: F(2, 94) = 0.836, p = 0.436, η2 = 0.017; Fig. 2a), even at the expense of it having a lower expected value (horizon main effect: F(1, 94) = 11.857, p = 0.001, η2 = 0.112; age main effect: F(2, 94) = 2.389, p = 0.097, η2 = 0.048; age-by-horizon interaction: F(2, 94) = 0.031, p = 0.969, η2 = 0.001; Fig. 2b). This is mainly driven by the fact that subjects selected the high-value bandit (i.e., the bandit with the highest expected reward based on the initial samples) less often in the long horizon (horizon main effect: F(1, 94) = 24.315, p < 0.001, η2 = 0.206; age main effect: F(2, 94) = 1.627, pcor = 0.808, punc = 0.202, η2 = 0.033; age-by-horizon interaction: F(2, 94) = 2.413, p = 0.095, η2 = 0.049; Fig. 4a; when adding IQ as a covariate: horizon main effect: F(1,94) = 24.017, p < 0.001, η2 = 0.204; age main effect: F(1,94) = 2.183, pcor = 0.429, punc = 0.143, η2 = 0.023; age-by-horizon interaction: F(1,94) = 2.462, p = 0.12, η2 = 0.026), demonstrating a reduction in exploitation when information can subsequently be used. This behaviour resulted in a lower initial reward (on the 1st sample) in the long compared with the short horizon (1st sample: horizon main effect: F(1, 94) = 13.874, p < 0.001, η2 = 0.129; age main effect: F(2, 94) = 1.752, p = 0.179, η2 = 0.036; age-by-horizon interaction: F(2, 94) = 1.167, p = 0.316, η2 = 0.024; Fig. 2c).
To evaluate whether subjects used the additional information in the long horizon condition beneficially, we compared the average reward (across six draws) obtained in the long compared to short horizon (one draw). The average reward was higher in the long horizon (horizon main effect: F(1, 94) = 17.757, p < 0.001, η2 = 0.159; age main effect: F(2, 94) = 2.945, p = 0.057, η2 = 0.059; age-by-horizon interaction: F(2, 94) = 0.555, p = 0.576, η2 = 0.012; Fig. 2c), indicating subjects tended to choose less optimal bandits at first but subsequently made use of the harvested information to guide a choice of better bandits in the long run. This also was the case when we looked at the long horizon exclusively and compared the increase in reward (difference between the obtained reward and the highest shown reward) between when subjects started with an exploitative choice (chose the bandit with the highest expected value) versus an exploratory one. Exploration decreased their reward at first (long horizon 1st choice: exploration main effect: F(1, 94) = 39.386, p < 0.001, η2 = 0.295; age main effect: F(2, 94) = 0.443, p = 0.643, η2 = 0.009; age-by-exploration interaction: F(2, 94) = 0.433, p = 0.650, η2 = 0.009; Fig. 2d), but eventually increased it (long horizon 6th choice: exploration main effect: F(1, 94) = 63.830, p < 0.001, η2 = 0.404; age main effect: F(2, 94) = 1.820, p = 0.168, η2 = 0.037; age-by-exploration interaction: F(2, 94) = 0.753, p = 0.474, η2 = 0.016; Fig. 2d), indicating that they were able to take advantage of the information gained through exploration.
Subjects explore using computationally expensive strategies and simple heuristics
To determine which exploration strategies subjects use, we compared 12 models (cf. Supplementary Materials) using K-fold cross-validation. Essentially, the data of each subject is partitioned into K folds (i.e., subsamples). Each model is fitted to K-1 folds and validated on the remaining fold (i.e., held-out data). This process is repeated K times so that each of the K folds is used as a validation set once. The model with the highest average likelihood of held-out data is then selected as the winning model. During model selection, we compared a UCB model (directed exploration and value-based random exploration), a Thompson model (uncertainty-driven value-based exploration), a hybrid of both and a combination of those with an ϵ-greedy (value-free random exploration) and/or a novelty bonus (novelty exploration). These models made different predictions about how an agent explores and makes the first draw in each trial. Using Thompson sampling (Gershman, 2018; Thompson, 1933; captured by the Thompson model), she takes both expected value and uncertainty into account, with higher uncertainty leading to more exploration (uncertainty-driven value-based exploration). Using the UCB algorithm (Auer, 2003; Gershman, 2018; part of the UCB model), she also takes both into account but chooses the bandit with the highest (additive) combination of expected information gain and reward value (directed exploration). This computation is then passed through a softmax decision function inducing so-called value-guided random exploration. The novelty bonus is a simplified version of the information bonus in UCB, which only applies to entirely novel options (novelty exploration). Using ϵ-greedy, a bandit is chosen entirely randomly, irrespective of expected values and uncertainties (i.e., value-free random exploration). Similarly to previous studies in adults (Dubois et al., 2021; Dubois & Hauser, 2021), we found that subjects used a mixture of computationally demanding strategies (i.e., Thompson sampling or UCB) and two heuristic exploration strategies (i.e., ϵ-greedy and the novelty bonus), as captured by the model comparison (paired-samples t-test: 1st model: Thompson+ϵ+η vs. 2nd model: UCB+ϵ+η: t(96) = 1.804, p = 0.074, d = 0.183; 1st model: Thompson+ϵ+η vs 3rd model: Thompson+ϵ: t(96) = 2.52, p = 0.013, d = 0.256; Thompson+ϵ+η vs Thompson: t(96) = 6.687, p < 0.01, d = 0.679; Fig. 3a). The winning model was given by Bayesian Model Selection
(Fig. 3b; cf. Supplementary Materials for more details). Simulations revealed that the winning model’s parameter estimates could be accurately recovered (Fig. 3c).
Value-Free Random Exploration Decreases in Late Adolescents
Value-free random exploration (captured by ϵ-greedy) predicts that ϵ% of the time each option will have equal probability of being chosen. Under this regime, in contrast to other exploration strategies, bandits with a known low value are more likely to be chosen. To assess the deployment of this exploration form across horizons, we investigated the behavioural signature—the frequency of selecting the low-value bandit—and found that it was higher in the long compared with the short horizon condition (horizon main effect: F(1, 94) = 8.837, p = 0.004, η2 = 0.086; Fig. 4b). This also was captured more formally by analysing the fitted ϵ parameter, which was larger in the long compared to the short horizon (horizon main effect: F(1, 94) = 20.63, p < 0.001, η2 = 0.180; Fig. 5a). These results indicate that subjects made use of value-free random exploration in a goal-directed way, deploying it more when it was beneficial.
Next, we investigated our hypothesis that the age groups differed in their use of value-free random exploration usage. We thus looked at the two measures of value-free random exploration: the frequency of selecting the low-value bandit and, more formally, the ϵ-greedy parameter. We found that age groups differed in the frequency of selecting the low-value bandit (age main effect: F(2, 94) = 4.927, p = 0.009, η2 = 0.095; age-by-horizon interaction: F(2, 94) = 0.236, p = 0.790, η2 = 0.005; Fig. 4b). This also was the case when controlling for IQ (adding IQ as a covariate: age main effect: F(1,94) = 4.467, p = 0.037, η2 = 0.045; age-by-horizon interaction: F(1,94) = 0.019, p = 0.89, η2 < 0.001). Interestingly, we found that the effect was primarily driven by a reduction of selecting the low-value bandit in late adolescents, compared with early adolescents and children (children vs. late adolescents: t(52) = 2.842, pcor = 0.015, punc = 0.005, d = 0.54; early vs. late adolescents: t(76) = 3.842, pcor = 0.001, punc < 0.001, d = 0.634), whilst children and early adolescents did not differ (t(52) = −0.648, pcor = 1, punc = 0.518, d = 0.115). This suggests that the reduction in the value-free random exploration heuristic usage occurs only later in adolescent development.
The same effect was observed when analysing the fitted ϵ parameter from the winning computational model (age main effect: F(2, 94) = 3.702, p = 0.028, η2 = 0.073; age-by-horizon interaction: F(2, 94) = 0.807, p = 0.449, η2 = 0.017; Fig. 5a). This also was the case when controlling for IQ (adding IQ as a covariate: F(1,94) = 5.583, p = 0.02, η2 = 0.056; age-by-horizon interaction: F(1,94) = 0.119, p = 0.73, η2 = 0.001). Again, this was driven by a reduced ϵ in the late adolescents compared to the younger groups (t(52) = 3.229, pcor = 0.006, punc = 0.002, d = 0.622; early vs. late adolescents: t(76) = 2.982, pcor = 0.009, punc = 0.003, d = 0.491; children vs. early adolescents: t(52) = 0.581, pcor = 1, punc = 0.562, d = 0.105). Our findings thus suggest that, compared with late adolescents, children and early adolescents rely more strongly on the computationally simple value-free random exploration.
No Observed Age Effect on Other Exploration Strategies
Next, we investigated whether the other exploration strategies also showed age differences, or whether value-free random exploration was the primary driver. When looking at the novelty heuristics (i.e., the tendency to select novel options), we did not observe any difference – neither in the frequency of selecting the novel bandit (age main effect: F(2, 94) = 0.341, pcor= 1, punc = 0.712, η2 = 0.007; horizon main effect: F(1, 94) = 1.534, p = 0.219, η2 = 0.016; age-by-horizon interaction: F(2, 94) = 1.522, p = 0.224, η2 = 0.031; adding IQ as a covariate: age main effect: F(1,94) = 0.014, pcor = 1, punc = 0.905, η2 < 0.001; age-by-horizon interaction: F(1,94) = 2.227, p = 0.139, η2 = 0.023; Fig. 4c), nor more formally in the fitted novelty bonus η (age main effect: F(2, 94) = 0.341, pcor = 1, punc = 0.712, η2 = 0.007; age-by-horizon interaction: F(2, 94) = 2.119, p = 0.126, η2 = 0.043; horizon main effect: F(1, 94) = 1.892, p = 0.172, η2 = 0.020; adding IQ as a covariate: age main effect: F(1,94) = 0.406, pcor = 1, punc = 0.526, η2 = 0.004; age-by-horizon interaction: F(1,94) = 3.372, p = 0.069, η2 = 0.035; Fig. 5b).
Next, we assess whether there are age differences for the indicator of complex exploration strategies. We thus compared the model-derived prior variance (or uncertainty) σ0, which is used for the computation of the uncertainty about the expected value of each bandit (Dubois et al., 2021; Gershman, 2018). Essentially, σ0 is the uncertainty about the reward that subjects expect to get from a bandit before integrating its initial samples. We did not observe any difference prior variance σ0 (i.e., uncertainty; age main effect: F(2, 94)=3.241, pcor = 0.132, punc = 0.044, η2 = 0.065; age-by-horizon interaction: F(2, 94) = 0.866, p = 0.424, η2 = 0.018; horizon main effect: F(1, 94) = 1.576, p = 0.212, η2 = 0.016; adding IQ as a covariate: age main effect: F(1,94) = 0.014, pcor = 1, punc = 0.905, η2 < 0.001; age-by-horizon interaction: F(1,94) = 2.227, p = 0.139, η2 = 0.023; Fig. 5c). We were thus not able to reliably identify any other exploration strategy that changed over these developmental stages.
Value-Free Random Exploration is Linked to ADHD Symptoms
Developmental effects on exploration strategies also are important to understand neurocognitive processes underlying developmental psychiatric disorders, such as ADHD, which has been suggested to be linked to excessive exploratory behaviour (Hauser et al., 2014; Hauser et al., 2016). A study has previously shown that value-free random exploration is a “cheap” exploration strategy modulated by noradrenaline (Dubois et al., 2021), a neurotransmitter known to be critically involved in the pathogenesis and treatment of ADHD (Arnsten & Pliszka, 2011; Berridge & Devilbiss, 2011; Del Campo et al., 2011; Frank et al., 2007; Hauser et al., 2016; Luman et al., 2010). Given that value-free random exploration stands out by its low computational demand, we hypothesized that ADHD symptoms in our population sample would be primarily linked to an over-reliance on this exploration heuristic.
We thus compared whether the amount of value-free random exploration was linked to ADHD scores as measured using Conners 3 self-reports (Conners, 2008). We found that ADHD symptoms were significantly associated with value-free random exploration captured by the model parameter ϵ (bivariate Pearson correlation: r = 0.259, p = 0.011; Fig. 6a) and as indicated by the low-value bandit picking frequency (r = 0.259, p = 0.01). The effect remained significant when additionally controlling for age and IQ (partial correlation with ϵ: r = 0.212, p = 0.039; with low-value bandit picking: r = 0.214, p = 0.037).
To further investigate this and to assess potential clinical implications, we split the data comparing those subjects that scored above the clinical cutoff of T ≥ 70 (Conners, 2008) (N = 15) and those scoring below (N = 82). In line with the above correlation, we found that these subjects with a highly elevated ADHD score used the value-free random exploration more excessively (model parameter ϵ: main effect of ADHD score: F(1,95) = 7.243, p = 0.008, η2 = 0.071).
We next investigated, whether this greater reliance on value-free random exploration was used in a goal-directed manner, i.e., deploying it when exploration was useful in the long horizon. Interestingly, the high ADHD group indeed deployed this exploration heuristic primarily when it was useful, i.e., in the long horizon (score-by-horizon interaction: F(1,95) = 4.643, p = 0.034, η2 = 0.047; pairwise comparisons: long horizon: t(82) = −3.655, pcor = 0.002, punc = 0.001; short horizon: t(82) = −1.355, pcor = 0.386, punc = 0.193; main effect of horizon: F(1,95) = 22.926, p < 0.001, η2 = 0.194; Fig. 6b).
We thus assessed whether this increase in exploration was beneficial or detrimental for their performance. We thus compared whether the high ADHD group earned more points in the long horizon. We found that the high ADHD group performed worse than the low ADHD group, i.e., scored less points (total score in the long horizon: t(82) = 2.221, p = 0.040), but not in the short horizon (total score in the short horizon: t(82) = 1.569, p = 0.136), where they deployed the exploration heuristic to a similar degree. This suggests that the subjects scoring high on ADHD “overshot” with deploying the value-free random exploration, thus leading to a worse performance in the condition where high exploration generally leads to a better performance.
Lastly, to test the specificity this association, we tested whether other model parameters were correlated with ADHD symptoms. We did not find any association between ADHD symptoms and any of the other exploration strategies (with novelty bonus η: r = −0.113, pcor = 1, punc = 0.269; with prior variance σ0: r = 0.01, pcor = 1, punc = 0.923), suggesting that value-free random exploration is the most relevant exploration factor for ADHD symptoms.