Pure correlates of exploration and exploitation in the human brain
- 489 Downloads
Balancing exploration and exploitation is a fundamental problem in reinforcement learning. Previous neuroimaging studies of the exploration–exploitation dilemma could not completely disentangle these two processes, making it difficult to unambiguously identify their neural signatures. We overcome this problem using a task in which subjects can either observe (pure exploration) or bet (pure exploitation). Insula and dorsal anterior cingulate cortex showed significantly greater activity on observe trials compared to bet trials, suggesting that these regions play a role in driving exploration. A model-based analysis of task performance suggested that subjects chose to observe until a critical evidence threshold was reached. We observed a neural signature of this evidence accumulation process in the ventromedial prefrontal cortex. These findings support theories positing an important role for anterior cingulate cortex in exploration, while also providing a new perspective on the roles of insula and ventromedial prefrontal cortex.
Keywordsreinforcement learning fMRI decision making
Many decision problems pose a fundamental dilemma between exploration and exploitation: An agent can exploit the option that has yielded the greatest reward in the past or explore other options that may yield greater reward, at the risk of foregoing some reward during exploration. The optimal solution to the exploration–exploitation dilemma is generally intractable, and hence resource-bounded agents must apply heuristic strategies (Cohen, McClure & Yu, 2007). The specific strategy used by humans is an open question.
Some evidence suggests that humans adopt exploration strategies that sample options with probability proportional to their estimated expected values (Daw, O’Doherty, Dayan, Seymour, & Dolan, 2006) or their posterior probability of having the maximum value (Speekenbrink & Konstantinidis, 2015). Other studies suggest that humans employ an uncertainty-driven exploration strategy based on an explicit exploration bonus (Badre, Doll, Long, & Frank, 2012; Frank, Doll, Oas-Terpstra, & Moreno, 2009). Humans also sometimes employ more sophisticated exploration strategies using model-based reasoning (Knox, Otto, Stone, & Love, 2012; Otto, Knox, Markman, & Love, 2014; Wilson, Geana, White, Ludvig, & Cohen, 2014; Gershman & Niv, 2015).
Neural data can potentially constrain the theories of exploration by identifying dissociable correlates of different strategies. For example, Daw et al. (2006) identified a region of frontopolar cortex that was significantly more active for putative exploratory choices compared to putative exploitative choices during a multiarmed bandit task (see also Boorman, Behrens, Woolrich, & Rushworth, 2009). Suppression of activity in this region, using transcranial direct current stimulation, reduces exploration, whereas amplifying activity increases exploration (Beharelle, Polania, Hare, & Ruff, 2015). These findings suggest that there may exist a dedicated neural mechanism for driving exploratory choice, analogous to regions in other species that have been found to inject stochasticity into songbird learning (Olveczky, Andalman, & Fee, 2005; Woolley, Rajan, Joshua, & Doupe, 2014) and rodent motor control (Santos, Oliveira, Jin, & Costa, 2015).
The main challenge in interpreting these studies is that exploratory and exploitative choices cannot be identified unambiguously in standard reinforcement learning tasks, such as multiarmed bandits. When participants fail to choose the value-maximizing option, it is impossible to know whether this choice is due to exploration or to random error (i.e., unexplained variance in choice behavior not captured by the model). The same ambiguity muddies the interpretation of individual differences in parameters governing exploration strategies (e.g., the temperature parameter in the softmax policy). Furthermore, exploitative choices yield information, whereas exploratory choices yield reward, obscuring the conceptual difference between these trial types. Finally, identifying deviations from value maximization depend on inferences about subjective value estimates, which in turn depend on assumptions about the exploration strategy. Thus, there is no theory-neutral way to contrast neural activity underlying exploration and exploitation in most reinforcement learning tasks.
We resolve this problem by using an “observe or bet” task that unambiguously separates exploratory and exploitative choice (Navarro, Newell, & Schulze, 2016; Tversky & Edwards, 1966). On each trial, the subject chooses either to observe the reward outcome of each option (without receiving any of the gains or losses) or to bet on one option, in which case she receives the gain or loss associated with the option at the end of the task. By comparing neural activity on observe and bet trials, we obtain pure correlates of exploration and exploitation, respectively. This also allows us to look at neural responses to the receipt of information without it being confounded with the receipt of reward. To gain further insight into the underlying mechanisms, we use the computational model recently developed by Navarro et al. (2016) to generate model-based regressors. In particular, we identify regions tracking the subject’s change in belief about the hidden state of the world, which in turn governs the subject’s exploration strategy.
It is important to clarify at the outset that the correlates we identify are “pure” only in the sense that exploratory observe trials do not involve value-based choice or reward receipt, while exploitative bet trials do not involve information acquisition. This is not, of course, a complete catalogue of cognitive processes involved in task performance, and both trial types surely involve a number of common processes (e.g., visual perception, memory retrieval, motor control). Our goal in this study is to isolate a subset of these processes that are central to theories of reinforcement learning.
Materials and method
We recruited 18 members of the Harvard community, through the Harvard Psychology Study Pool, to participate in the study. Eleven of the 18 subjects were female. Ages ranged from 21 to 36 years, with a median age of 26 years. All subjects were right-handed, native English speakers, and had no history of neurological or psychiatric disease. Informed consent was obtained from all subjects.
Subjects performed the task in two sessions. In the first session, subjects were familiarized with the task and performed five blocks outside of the fMRI scanner. In the second session, subjects performed two blocks of the task out of the scanner, and an additional four to five (depending on time constraints) in the scanner. Subjects were paid $10 for the first session and $35 for the second. They also received a bonus in the form of an Amazon gift card, at an amount of $0.10 per point earned in the task.
Subjects performed a dynamic version of the “observe or bet” task (Tversky & Edwards, 1966; Navarro et al., 2016). In this task, subjects were asked to predict which of two lights (red or blue) will light up on a machine. On each trial, a single light is activated. The machine always has a bias—on a particular block, it either will tend to light up the blue or the red light. On each trial, subjects could take one of three actions: bet blue, bet red, or observe. If the subjects bet blue or red, they gained a point if they correctly predicted which light would light up but lost one if they were incorrect. Importantly, they were not told if they gained or lost a point, and they also did not see what light actually lit up. Instead, subjects could only see which light was activated by taking the observe action. Observing did not cost any points, but subjects relinquished their opportunity to place a bet on that trial. Thus, subjects were compelled to choose between gaining information about the current bias (by observing) or using the information they had gathered up to that point to obtain points (by betting).
Each block consisted of 50 trials. On each block, the machine was randomly set to have a blue or a red bias. The biased color caused the corresponding light to be active on 80% of the trials. There was also a 5% chance that the bias would change during the block. This change was not signaled to the subject in any way and could only be detected through taking “observe” actions.
To understand performance in our task mechanistically, we fit a computational model to the choice behavior, created to qualitatively match the features of the optimal decision strategy and shown to best fit subject behavior out of four candidate process models (Navarro et al., 2016). Central to the model is an evidence tally that starts with a value of zero. Positive evidence reflects evidence that the bias is blue, and negative reflects evidence that the bias is red. Thus, low absolute numbers reflect a state of uncertainty about the bias. Each time an observation is made, the evidence value changes by +1 if blue is observed and −1 if red is observed.
The relevance of old observations diminishes over time, modeled using an evidence decay parameter, α. The evidence decay parameter dictates what proportion of evidentiary value is lost on each trial. Thus, the evidence tally value is calculated as follows:
The other main component of the model is a decision threshold. The threshold is a value at which the learner will switch from observing to betting. In the model used here (the best fitting model reported in Navarro et al., 2016), the decision threshold follows a piece-wise linear structure across trials: it remains constant until a specific trial, at which point it changes at a constant rate until the final trial. The initial threshold, the trial at which the threshold begins changing (the change point), and the terminal value of the threshold are all parameters fit to the data.
Finally, because decision-makers are noisy, we also include a response stochasticity parameter, σ. Assuming a normally distributed noise term for each trial, nt, with a zero mean and a standard deviation of σ, the probability of betting blue is then:
For the evidence decay parameter:
We implemented the model in Stan (Stan Development Team, 2016) and used Markov chain Monte Carlo sampling to approximate the posterior distribution over parameters. For the fMRI analysis, we used the posterior median parameter values for each subject to generate model-based regressors.
Neuroimaging data were collected using a 3 Tesla Siemens Magnetom Prisma MRI scanner (Siemens Healthcare, Erlangen, Germany) with the vendor’s 32-channel head coil. Anatomical images were collected with a T1-weighted magnetization-prepared rapid gradient multiecho sequence (MEMPRAGE, 176 sagittal slices, TR = 2530 ms, TEs = 1.64, 3.50, 5.36, and 7.22ms, flip angle = 7°, 1 mm 3 voxels, FOV = 256 mm). All blood-oxygen-level-dependent (BOLD) data were collected via a T2*-weighted echo-planar imaging (EPI) pulse sequence that employed multiband RF pulses and Simultaneous Multi-Slice (SMS) acquisition (Feinberg et al., 2010; Moeller et al., 2010; Xu et al., 2013). For the task runs, the EPI parameters were 69 interleaved axial-oblique slices (25 degrees toward coronal from ACPC alignment, TR = 2000 ms, TE = 35 ms, flip angle = 80°, 2.2 mm3 voxels, FOV = 207 mm, SMS = 3). The SMS-EPI acquisitions used the CMRR-MB pulse sequence from the University of Minnesota.
fMRI preprocessing and analysis
Data preprocessing and statistical analyses were performed using SPM12 (Wellcome Department of Imaging Neuroscience, London, UK). Functional (EPI) image volumes were realigned to correct for small movements occurring between scans. This process generated an aligned set of images and a mean image per subject. Each participant’s T1-weighted structural MRI was then coregistered to the mean of the realigned images and segmented to separate out the gray matter, which was normalized to the gray matter in a template image based on the Montreal Neurological Institute (MNI) reference brain. Using the parameters from this normalization process, the functional images were normalized to the MNI template (resampled voxel Size 2-mm isotropic) and smoothed with an 8-mm full-width at half-maximum Gaussian kernel. A high-pass filter of 1/128 Hz was used to remove low-frequency noise, and an AR(1) (Autoregressive 1) model was used to correct for temporal autocorrelations.
We designed a general linear model model to analyze BOLD responses. This model included an event for observe decisions and another for bet decisions, time locked to the beginning of the decision period. We also included an event for the onset of feedback (either the observation of which light turned on, or just a visual of the machine with the bet that was made). For the onset of feedback, we included a parametric modulator that was the change in the absolute value of the evidence tally resulting from the observed outcome. Thus, this value would be negative and due entirely to evidence decay on a bet trial, and could be positive or negative on an observation trial depending on whether the observation provided more evidence in favor of betting or observing. Events were modeled with a 1-s duration.
Regions of interest
Regions of interest (ROIs) were constructed by combining structural ROIs with previously defined functional ROIs. Specifically, to define anatomically constrained value-based ROIs, we found the overlap between the structural ROIs from Tzourio-Mazoyer et al. (2002) and the value-sensitive functional ROIs from Bartra, McGuire, and Kable (2013). We also took the specific vmPFC and striatum ROIs from Bartra et al. (2013). For frontopolar cortex, we constructed a spherical ROI with a radius of 10 voxels, centered at the peak of activation reported by Daw et al. (2006). Similarly, for rostrolateral prefrontal cortex, the spherical ROI (10-voxel radius) was constructed using the coordinates given in Badre et al. (2012).
Code and data availability
Code and behavioral data are available on GitHub (https://github.com/TommyBlanchard/ObserveBet). The brain imaging data are available upon request.
Subjects should also gradually reduce the probability of observing over the course of a block. This is because they start with no information about the outcome probability and thus must start by accumulating some information, but this tendency to explore will eventually yield to betting (exploitation) when the evidence becomes sufficiently strong. Again, subjects follow this pattern, observing 85.3% of the time on the first trial in a block and betting 98.4% on the final trial (see Fig. 2b).
Next, we implemented a previously developed computational model and fit it to subjects’ choice data (Navarro et al., 2016). This model consists of an evidence tally that tracks how much evidence the learner currently has about the outcome probability, and a decision threshold that captures when the subject switches between observe and bet behaviors (see Fig. 2c). We fit this model to each subject’s behavior from the prescanning blocks and used the fitted model to construct regressors for our fMRI analysis (see Method section). Behavior was stable across prescanning and scanning blocks (see Fig. 2d–e).
In a follow-up session, our 18 subjects returned and performed the “observe or bet” task in an fMRI scanner. Our model contained regressors for the appearance of stimuli, when a subject observed, when a subject bet, and the change in the absolute value of the evidence tally (see Materials and Method section).
We first attempted to identify regions associated with the decision to explore versus exploit (i.e., observe vs. bet). We chose to specifically investigate brain regions previously associated with value-based decision-making or exploration. Specifically, we examined the frontal pole and rostrolateral prefrontal cortex, which have both previously been implicated in balancing exploration and exploitation (Badre et al., 2012; Boorman et al., 2009; Daw et al., 2006; Donoso, Collins, & Koechlin, 2014). We also investigated the striatum, ventromedial prefrontal cortex (vmPFC), insula, and dorsal anterior cingulate cortex (dACC), all of which play a role in value-based decision-making (Bartra et al., 2013). We analyzed the signal in each of these ROIs, averaged across voxels (see Materials and Method section for details of ROI construction).
Table of values for the ROI analyses for the group-level observe–bet contrast
Cluster size (voxels)
32, 22, −8 (Right)
−30, 16, -8 (Left)
Dorsal Anterior Cingulate
8, 16, 46
One potential concern with this analysis is that if people tend to switch from observing to betting more frequently than vice versa, any contrast between observe and bet trials would be confounded with task switching effects. Indeed, subjects were significantly more likely to switch following an observe trial (p < .001, signed rank test). If this lead to differential switch costs, then we would expect that responses should be slower on bet trials than on observe trials, consistent with the empirical data, t(17) = 2.2, p < .05 (mean difference: 48 ms). Thus, our data do not allow us to completely rule out a task switching confound.
Values for the ROI analyses for the “update” contrast
Cluster size (voxels)
−4, 36, −16
Using a reinforcement learning task that cleanly decouples exploration and exploitation, our study provides the first pure neural correlates of these processes. Insula and dorsal anterior cingulate cortex showed greater activation for observe (exploration) trials compared to bet (exploitation) trials. Ventromedial prefrontal cortex showed greater activation for bet compared to observe trials, although this result did not survive correction for multiple comparisons across the regions of interest that we examined. We also found behavioral evidence favoring a heuristic approximation of the Bayes-optimal exploration strategy (Navarro et al., 2016): The probability of exploration changed dynamically as evidence was accumulated. These dynamics were accompanied by a neural correlate in the vmPFC that negatively correlated with the size of the belief update, suggesting that this region may encode the degree to which outcomes match prior expectations.
The anterior cingulate cortex has figured prominently in past research on the exploration–exploitation dilemma, though its computational role is still unclear. Consistent with our findings, the anterior cingulate shows increased activity during exploration in multi-armed bandit (Amiez, Sallet, Procyk, & Petrides, 2012; Daw et al., 2006; Karlsson, Tervo, & Karpova, 2012; Quilodran, Rothe, & Procyk, 2008), foraging (Hayden, Pearson, & Platt, 2011; Kolling, Behrens, Mars, & Rushworth, 2012) and sequential problem-solving tasks (Procyk, Tanaka, & Joseph, 2000). Some evidence suggests that the anterior cingulate reports the value of alternative options (Blanchard & Hayden, 2014; Boorman, Rushworth, & Behrens, 2013; Hayden et al., 2011; Kolling et al., 2012); when this value exceeds the value of the current option, the optimal policy is to explore. Shenhav, Botvinick, and Cohen (2013) have argued that exploration is a control-demanding behavior, requiring an override of the currently dominant behavior in order to pursue long-term greater long-term rewards. In this framework, anterior cingulate reports the expected long-term value of invoking cognitive control.
The insula has also been implicated in several studies of the exploration–exploitation dilemma. Li, McClure, King-Casas, and Montague (2006) found insula activation in response to changes in reward structure during a dynamic economic game. These changes were accompanied by rapid alterations in the behavioral strategy. In a study of adolescents, Kayser, Op de Macks, Dahl, & Frank (2016) found that resting-state connectivity between rostrolateral prefrontal cortex and insula distinguished “explorers” from “nonexplorers” on a temporal decision-making task. Finally, using positron emission tomography while subjects performed a bandit task, Ohira et al. (2013) reported that insula activity was correlated both with peripheral catecholamine concentration and response stochasticity. These results are consistent with our finding that insula was positively associated with exploration, though they do not provide insight into the region’s specific contribution.
Surprisingly, we did not find a statistically significant effects of exploration in either frontopolar cortex or rostrolateral prefrontal cortex. Several influential studies have identified these regions as playing an important role in regulating exploration and exploitation (Badre et al., 2012; Beharelle et al., 2015; Boorman et al., 2009; Daw et al., 2006). It is not clear why we did not find effects in these regions; it is possible that our ROI selection procedure failed to identify the relevant voxels, or that these regions are primarily involved in other kinds of tasks (e.g., standard bandit or temporal decision-making tasks). One approach to this issue would be to define subject-specific functional ROIs using these other tasks and then interrogate regional responses using the observe or bet task. Another possibility is that substantive differences in task design and analysis account for the lack of activation. For example, Daw et al. (2006) defined exploratory versus exploitative trials based on whether subjects chose the option with highest expected value, whereas in our study subjects might choose options with either high or low expected value on exploratory trials.
Our model-based analysis posits that an important computation governing exploration is the updating of the belief state. We found a negative effect of updating in the vmPFC, indicating that this region was more active when expectations were confirmed. One way to interpret this finding is that the ventromedial prefrontal cortex signals a match between outcomes and expectations (i.e., a kind of “confirmation” or “match” signal). An analogous match signal has been observed in a visual same/different judgment task (Summerfield & Koechlin, 2008). In a related vein, Stern, Gonzalez, Welsh, and Taylor (2010) reported that signals in vmPFC correlated with “underconfidence” (the degree to which self-reported posterior probabilities underestimate objective posterior probabilities), consistent with the hypothesis that reduced updating will elicit greater vmPFC activity.
In the context of reinforcement learning and decision-making tasks, the ventromedial prefrontal cortex has more commonly been associated with reward expectation (Bartra et al., 2013) rather than outcome-expectation comparisons. Nonetheless, a number of studies have reported evidence accumulation correlates in this region or nearby regions (d’Acremont, Fornari, & Bossaerts, 2013; Chan, Niv, & Norman, 2016). More research is needed to pinpoint the relationship between these findings and exploration during reinforcement learning.
One limitation of our approach is that exploration is confounded with time: subjects are less likely to observe on later trials. A promising approach to dealing with this issue would be to use a yoked control condition in which subjects see the same sequence of trials without the trial types being contingent on their own actions (cf. Wang & Voss, 2014). However, this yoked control is imperfect insofar as it essentially eliminates the exploration–exploitation trade-off.
Another limitation of our approach is that we only considered a single model in detail, one developed specifically to approximate the Bayes-optimal strategy on the observe-or-bet task (Navarro et al., 2016). Navarro and colleagues compared this model to several variants, which differed in terms of their assumptions about evidence decay and decision thresholds. They concluded, on the basis of qualitative and quantitative measures of model fit, that both decaying evidence and declining thresholds were necessary to account for the choice data. Although this is still a fairly restricted space of models, it is worth pointing out that most conventional reinforcement learning models cannot address the task at all: Because the observe action does not accrue any points, it will always be assigned a value of zero by model-free algorithms like Q-learning. Nonetheless, the model developed by Navarro and colleagues invokes cognitive mechanisms that are shared across many other models, such as incremental adjustment of expectations (as in Q-learning) and decisions based on a stochastic threshold-crossing (as in sequential sampling models). The interface of these mechanisms has recently become an important focus of research in reinforcement learning (Frank et al., 2015; Pedersen, Frank, & Biele, 2017).
Finally, we must keep in mind that while the observe-or-bet task provides “pure” correlates by decoupling information acquisition and action selection, there are many other cognitive processes involved in exploration and exploitation, which may be shared across observe and bet trials. Thus, we cannot decisively conclude that this contrast has perfectly isolated the critical computations underlying exploration and exploitation. It is unlikely that any single task will be able to achieve complete purity in this sense, so our findings should be understood as complementing, rather than superseding, previous studies of exploration and exploitation, all of which have their strengths and weaknesses.
In summary, the main contribution of our study is the isolation of neural correlates specific to exploration. The major open question is computational: What exactly do the insula and anterior cingulate contribute to exploration? As discussed in the preceding paragraphs, the literature is well-supplied with hypotheses, but our study was not designed to discriminate between them. Thus, an important task for future research will be to use tasks like “observe or bet” in combination with experimental manipulations (e.g., volatility or the distribution of rewards) that are diagnostic of underlying mechanisms.
We are grateful to Joel Voss for helpful comments on an earlier draft. This research was carried out at the Harvard Center for Brain Science with the support of the Pershing Square Fund for Research on the Foundations of Human Behavior. This work involved the use of instrumentation supported by the NIH Shared Instrumentation Grant Program, Grant No. S10OD020039. We acknowledge the University of Minnesota Center for Magnetic Resonance Research for use of the multiband-EPI pulse sequences.
Compliance with ethical standards
Conflict of interest
The authors declare no competing financial interests.
- Boorman, E. D., Rushworth, M. F., & Behrens, T. E. (2013). Ventromedial prefrontal and anterior cingulate cortex adopt choice and default reference frames during sequential multi-alternative choice. The Journal of Neuroscience, 33, 2242–2253.Google Scholar
- Cohen, J. D., McClure, S. M., & Yu, A. J. (2007). Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 362, 933–942.CrossRefPubMedPubMedCentralGoogle Scholar
- d’Acremont, M., Fornari, E., & Bossaerts, P. (2013). Activity in inferior parietal and medial prefrontal cortex signals the accumulation of evidence in a probability learning task. PLOS ONE, 9, e1002895.Google Scholar
- Erev, I., & Roth, A. E. (1998). Predicting how people play games: Reinforcement learning in experimental games with unique, mixed strategy equilibria. American Economic Review, 88, 848–881.Google Scholar
- Feinberg, D. A., Moeller, S., Smith, S. M., Auerbach, E., Ramanna, S., Glasser, M. F., … Yacoub, E. (2010). Multiplexed echo planar imaging for sub-second whole brain fMRI and fast diffusion imaging. PLOS ONE, 5, e15710.Google Scholar
- Moeller, S., Yacoub, E., Olman, C. A., Auerbach, E., Strupp, J., Harel, N., & Uğurbil, K. (2010). Multiband multislice GE-EPI at 7 Tesla with 16-fold acceleration using partial parallel imaging with application to high spatial and temporal whole-brain fMRI. Magnetic Resonance in Medicine, 63, 1144–1153.CrossRefPubMedPubMedCentralGoogle Scholar
- Quilodran, R., Rothe, M., & Procyk, E. (2008). Behavioral shifts and action valuation in the anterior cingulate cortex. Neuron, 57, 314–325.Google Scholar
- Speekenbrink, M., & Konstantinidis, E. (2015). Uncertainty and exploration in a restless bandit problem. Topics in Cognitive Science, 7, 351–367.Google Scholar
- Spunt, B. (2016). spunt/bspmview: BSPMVIEW v.20161108. Zenodo. Retrieved from https://zenodo.org/record/168074
- Stan Development Team (2016). RStan: The R interface to Stan (R Package Version 2.14.1) [Computer software]. Retrieved from http://mc-stan.org
- Xu, J., Moeller, S., Auerbach, E. J., Strupp, J., Smith, S. M., Feinberg, D. A., … Ugurbil, K. (2013). Evaluation of slice accelerations using multiband echo planar imaging at 3 T. NeuroImage, 83, 991–1001.Google Scholar