Abstract
Humans explore to learn the structure of our environment. However, it remains unclear how consistent humans are in the exploration strategies we use and how often we explore across different environments which vary in their volatility. Using a within-subjects design, participants (n = 30) completed (1) a non-stationary bandit task where the reward values changed throughout, and (2) a stationary bandit task where one option always gave better reward. We used a series of reinforcement learning models to understand the exploration strategies humans adopted in the two tasks. We found that most participants adopted a behavioural heuristic strategy (Win-Stay, Lose-Shift) in the non-stationary bandit task. In contrast, most participants adopted a probabilistic, random exploration strategy (Softmax) in the stationary bandit task. We compared our results when fitting models individually within each task to when fitting models across both tasks—that is focusing on long-term predictions. When fitting across both tasks we found that most participants solely adopted a probabilistic, random exploration strategy. In addition, we found a moderate, positive relationship between exploration rate in each of the two bandit tasks. Our findings show that humans can flexibly adopt different exploration strategies depending on task demands, which we suggest is because the two bandit tasks assessed different aspects of learning and required different levels of cognitive flexibility. In addition, we speculate that the relationship between exploration rate could reflect a personality trait such as risk-taking. In sum, we found evidence for the flexible use of exploration strategies, while also observing evidence of the generalization of exploration across tasks.
Similar content being viewed by others
Data Availability
The datasets generated are available from the corresponding author upon reasonable request. Datasets are not publicly available because participants did not consent for their data to be shared in a public repository. Modeling and analysis code will be published on: https://github.com/tomferg/BanditComp
Code Availability
Code generated during modeling and analysis will be published on the following Github link: https://github.com/tomferg/BanditComp
Notes
In the non-stationary task, we always divided the points values obtained by 100 for all models where reward estimates were required (ε-Greedy, Softmax, Sliding Window Upper Confidence Bound, Gradient; Kalman Filter with Thompson Sampling).
References
Agrawal, R. (1995). Sample mean based index policies by o (log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4), 1054–1078.
Ahn, W. Y., Busemeyer, J. R., Wagenmakers, E. J., & Stout, J. C. (2008). Comparison of decision learning models using the generalization criterion method. Cognitive Science, 32(8), 1376–1402. https://doi.org/10.1080/03640210802352992
Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3, 397–422. https://doi.org/10.4271/610369
Barron, G., & Erev, I. (2003). Small feedback-based decisions and their limited correspondence to description-based decisions. Journal of Behavioral Decision Making, 16(3), 215–233. https://doi.org/10.1002/bdm.443
Barto, A. G., & Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(1–2), 41–77.
Behrens, T. E. J., Woolrich, M. W., Walton, M. E., & Rushworth, M. F. S. (2007). Learning the value of information in an uncertain world. Nature Neuroscience, 10(9), 1214–1221. https://doi.org/10.1038/nn1954
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1), 289–300.
Bennett, D., Niv, Y., & Langdon, A. J. (2021). Value-free reinforcement learning: Policy optimization as a minimal model of operant behavior. In Current Opinion in Behavioral Sciences, 41, 114–121. https://doi.org/10.1016/j.cobeha.2021.04.020. Elsevier Ltd.
Berridge, K. C. (2000). Reward learning: Reinforcement, incentives, and expectations. Psychology of Learning and Motivation - Advances in Research and Theory, 40, 223–278. https://doi.org/10.1016/s0079-7421(00)80022-5
Berry, D. A., & Fristedt, B. (1985). Bandit Problems. Chapman and Hall.
Bonawitz, E., Denison, S., Gopnik, A., & Griffiths, T. L. (2014). Win-Stay, Lose-Sample: A simple sequential algorithm for approximating Bayesian inference. Cognitive Psychology, 74, 35–65. https://doi.org/10.1016/j.cogpsych.2014.06.003
Botvinick, M. M. (2012). Hierarchical reinforcement learning and decision making. Current Opinion in Neurobiology, 22(6), 956–962. https://doi.org/10.1016/j.conb.2012.05.008
Brainard, D. H. (1997). The Psychophysics Toolbox. Spatial Vision, 10(4), 433–436.
Brändle, F., Binz, M., & Schulz, E. (2022). Exploration beyond bandits. In Cogliati Dezza, I., Schulz, E., & Wu, C.M. (eds.) The Drive for Knowledge (pp. 147–168). Cambridge University Press. https://doi.org/10.1017/9781009026949.008
Brown, V. M., Hallquist, M. N., Frank, M. J., & Dombrovski, A. Y. (2022). Humans adaptively resolve the explore-exploit dilemma under cognitive constraints: Evidence from a multi-armed bandit task. Cognition, 229. https://doi.org/10.1016/j.cognition.2022.105233
Browning, M., Behrens, T. E., Jocham, G., O’Reilly, J. X., & Bishop, S. J. (2015). Anxious individuals have difficulty learning the causal statistics of aversive environments. Nature Neuroscience, 18(4), 590–596. https://doi.org/10.1038/nn.3961
Busemeyer, J. R., & Wang, Y. M. (2000). Model comparisons and model selections based on generalization criterion methodology. Journal of Mathematical Psychology, 44(1), 171–189.
Cavanagh, J. F., & Frank, M. J. (2014). Frontal theta as a mechanism for cognitive control. Trends in Cognitive Sciences, 18(8), 414–421. https://doi.org/10.1016/j.tics.2014.04.012
Cohen, J. D., McClure, S. M., & Yu, A. J. (2007). Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration. Philosophical Transactions of the Royal Society b: Biological Sciences, 362(1481), 933–942. https://doi.org/10.1098/rstb.2007.2098
Costa, V. D., Dal Monte, O., Lucas, D. R., Murray, E. A., & Averbeck, B. B. (2016). Amygdala and ventral striatum make distinct contributions to reinforcement learning. Neuron, 92(2), 505–517. https://doi.org/10.1016/j.neuron.2016.09.025
Dammhahn, M., & Almeling, L. (2012). Is risk taking during foraging a personality trait? A field test for cross-context consistency in boldness. Animal Behaviour, 84(5), 1131–1139. https://doi.org/10.1016/j.anbehav.2012.08.014
Daw, N. D. (2011). Trial-by-trial data analysis using computational models. Decision Making, Affect, and Learning: Attention and Performance, XXIII, 1–26. https://doi.org/10.1093/acprof:oso/9780199600434.003.0001
Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B., & Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441(7095), 876–879. https://doi.org/10.1038/nature04766
Dayan, P. (2013). Exploration from generalization mediated by multiple controllers. In Baldassarre, G., & Mirolli, M. (eds.), Intrinsically Motivated Learning in Natural and Artificial Systems (pp. 73–91). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-32375-1
Dayan, P., & Yu, A. J. (2006). Phasic norepinephrine: A neural interrupt signal for unexpected events. Network: Computation in Neural Systems, 17(4), 335–350. https://doi.org/10.1080/09548980601004024
Diuk, C., Tsai, K., Wallis, J., Botvinick, M., & Niv, Y. (2013). Hierarchical learning induces two simultaneous, but separable, prediction errors in human basal ganglia. Journal of Neuroscience, 33(13), 5797–5805. https://doi.org/10.1523/JNEUROSCI.5445-12.2013
Dubois, M., & Hauser, T. U. (2022). Value-free random exploration is linked to impulsivity. Nature Communications, 13(1). https://doi.org/10.1038/s41467-022-31918-9
Eckstein, M. K., Master, S. L., Xia, L., Dahl, R. E., Wilbrecht, L., & Collins, A. (2022). The interpretation of computational model parameters depends on the context. eLife, 11, 75474. https://doi.org/10.7554/eLife
Feher da Silva, C., Lombardi, G., Edelson, M., & Hare, T. A. (2023). Rethinking model-based and model-free influences on mental effort and striatal prediction errors. Nature Human Behaviour, 7(6), 956–969. https://doi.org/10.1038/s41562-023-01573-1
Ferguson, T. D., Bub, D. N., Masson, M. E. J., & Krigolson, O. E. (2021). The role of cognitive control and top-down processes in object affordances. Attention, Perception, and Psychophysics, 83(5), 2017–2032. https://doi.org/10.3758/s13414-021-02296-z
Fernie, G., & Tunney, R. J. (2006). Some decks are better than others: The effect of reinforcer type and task instructions on learning in the Iowa Gambling Task. Brain and Cognition, 60(1), 94–102. https://doi.org/10.1016/j.bandc.2005.09.011
Fitts, P. M., & Seeger, C. M. (1953). S-R compatibility: spatial characteristics of stimulus and response codes. Journal of Experimental Psychology, 46(3), 199–210.
Garivier, A., & Moulines, E. (2008). On upper-confidence bound policies for non-stationary bandit problems. http://arxiv.org/abs/0805.3415
Gershman, S. J. (2019). Uncertainty and exploration. Decision, 6(3), 277–286. https://doi.org/10.1037/dec0000101.Uncertainty
Gittins, J. C., & Jones, D. M. (1974). A dynamic allocation index for the sequential design of experiments. In J. Gani, K. Sarkadi, & I. Vincze (Eds.), Progress in Statistics (pp. 241–266). North-Holland.
Guo, D., & Yu, A. J. (2018). Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed bandit task. Advances in Neural Information Processing Systems, 31.
Hassall, C. D. (2019). The neural correlates of exploration. (Doctoral dissertation, University of Victoria).
Hassall, C. D., & Krigolson, O. E. (2020). Neuropsychologia feedback processing is enhanced following exploration in continuous environments. Neuropsychologia, 146, 107538. https://doi.org/10.1016/j.neuropsychologia.2020.107538
Hayden, B. Y., & Niv, Y. (2021). The case against economic values in the orbitofrontal cortex (or anywhere else in the brain). Behavioral Neuroscience, 135(2), 192.
Holroyd, C. B., & Coles, M. G. H. (2002). The neural basis of human error processing: Reinforcement learning, dopamine, and the error-related negativity. Psychological Review, 109(4), 679–709. https://doi.org/10.1037/0033-295X.109.4.679
Holroyd, C. B., & Yeung, N. (2012). Motivation of extended behaviors by anterior cingulate cortex. Trends in Cognitive Sciences, 16(2), 122–128. https://doi.org/10.1016/J.TICS.2011.12.008
Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 429–449.
Joensson, M., Thomsen, K. R., Andersen, L. M., Gross, J., Mouridsen, K., Sandberg, K., Østergaard, L., & Lou, H. C. (2015). Making sense: Dopamine activates conscious self-monitoring through medial prefrontal cortex. Human Brain Mapping, 36(5), 1866–1877. https://doi.org/10.1002/hbm.22742
Kalman, R. E. (1960). A new approach to linear filtering and prediction theory. Transactions of the ASME-Journal of Basic Engineering, 82(Series D), 35–45.
Knox, W. B., Otto, A. R., Stone, P., & Love, B. C. (2012). The nature of belief-directed exploratory choice in human decision-making. Frontiers in Psychology, 2:398. https://doi.org/10.3389/fpsyg.2011.00398
Kool, W., & Botvinick, M. (2018). Mental labour. In Nature Human Behaviour, 2(12), 899–908. https://doi.org/10.1038/s41562-018-0401-9. Nature Publishing Group.
Krigolson, O. E. (2018). Event-related brain potentials and the study of reward processing: Methodological considerations. International Journal of Psychophysiology, 32(B), 175–183. https://doi.org/10.1016/j.ijpsycho.2017.11.007
Lattimore, T., & Szepesvári, C. (2020). Bandit Algorithms. Cambridge University Press.
Levene, H. (1960). Robust tests for equality of variances. In I. Olkin, et al. (Eds). Contributions to Probability and Statistics. (pp. 278–292). Stanford University Press.
Lewandowsky, S., & Farrell, S. (2011). Computational modeling in cognition: Principles and practice. SAGE Publications Inc.
Li, J., & Daw, N. D. (2011). Signals in human striatum are appropriate for policy update rather than value prediction. Journal of Neuroscience, 31(14), 5504–5511. https://doi.org/10.1523/JNEUROSCI.6316-10.2011
Love, B. C., & Gureckis, T. M. (2007). Models in search of a brain. Cognitive, Affective, & Behavioral Neuroscience, 7(2), 90–108.
Ludwig, T., Wu, C. M., & Schulz, E. (2022). Connecting exploration, generalization, and planning in correlated trees. Proceedings of the Annual Meeting of the Cognitive Science Society.
Marr, D. (1982). Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. MIT Press.
Meder, B., Wu, C. M., Schulz, E., & Ruggeri, A. (2021). Development of directed and random exploration in children. Developmental Science, 24(4). https://doi.org/10.1111/desc.13095
Mone, M. A., & Shalley, C. E. (1995). Effects of task complexity and goal specificity on change in strategy and performance over time. Human Performance, 8(4), 243–262. https://doi.org/10.1207/s15327043hup0804_1
Neimark, E. D., & Shuford, E. H. (1959). Comparison of predictions and estimates in a probability learning situation. Journal of Experimental Psychology, 57(5), 294–298. https://doi.org/10.1037/h0043064
Niv, Y. (2009). Reinforcement learning in the brain. Journal of Mathematical Psychology, 53(3), 139–154. https://doi.org/10.1016/J.JMP.2008.12.005
Palminteri, S., Wyart, V., & Koechlin, E. (2017). The importance of falsification in computational cognitive modeling. In Trends in Cognitive Sciences, 21(6), 425–433. https://doi.org/10.1016/j.tics.2017.03.011. Elsevier Ltd.
Payzan-LeNestour, É., & Bossaerts, P. (2012). Do not bet on the unknown versus try to find out more: estimation uncertainty and “unexpected uncertainty” both modulate exploration. Frontiers in Neuroscience, 6:150. https://doi.org/10.3389/fnins.2012.00150
Pelli, D. G. (1997). The VideoToolbox software for visual psychophysics: Transforming numbers into movies. Spatial Vision, 10(4), 437–442.
R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58, 527–535.
Saragosa-Harris, N. M., Cohen, A. O., Reneau, T. R., Villano, W. J., Heller, A. S., & Hartley, C. A. (2022). Real-world exploration increases across adolescence and relates to affect, risk taking, and social connectivity. Psychological Science, 33(10), 1664–1679. https://doi.org/10.1177/09567976221102070
Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science (New York, N.Y.), 275(5306), 1593–1599. https://doi.org/10.1126/SCIENCE.275.5306.1593
Schulz, E., Konstantinidis, E., & Speekenbrink, M. (2018a). Putting bandits into context: How function learning supports decision making. Journal of Experimental Psychology: Learning, Memory, and Cognition, 44(6), 927–943. https://doi.org/10.1101/081091
Schulz, E., Wu, C. M., Huys, Q. J. M., Krause, A., & Speekenbrink, M. (2018b). Generalization and search in risky environments. Cognitive Science, 42(8), 2592–2620. https://doi.org/10.1111/cogs.12695
Shahar, N., Moran, R., Hauser, T. U., Kievit, R. A., McNamee, D., Moutoussis, M., Nspn, C., & Dolan, R. J. (2019). Credit assignment to state-independent task representations and its relationship with model-based decision making. Proceedings of the National Academy of Sciences of the United States of America, 116(32), 15871–15876. https://doi.org/10.1073/pnas.1821647116
Shields, G. S. (2020). Psychoneuroendocrinology Stress and cognition : A user’s guide to designing and interpreting studies. Psychoneuroendocrinology, 112, 104475. https://doi.org/10.1016/j.psyneuen.2019.104475
Speekenbrink, M., & Konstantinidis, E. (2015). Uncertainty and exploration in a restless bandit problem. Topics in Cognitive Science, 7(2), 351–367. https://doi.org/10.1111/tops.12145
Sripada, C. S. (2018). An exploration/exploitation trade-off between mind wandering and goal-directed thinking. In K. Christoff & K. C. R. Fox (Eds.), The Oxford handbook of spontaneous thought: Mind-wandering, creativity, and dreaming (pp. 23–34). Oxford University Press.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3–4), 285–294.
Umemoto, A., Inzlicht, M., & Holroyd, C. B. (2018). Electrophysiological indices of anterior cingulate cortex function reveal changing levels of cognitive effort and reward valuation that sustain task performance. Neuropsychologia. https://doi.org/10.1016/J.NEUROPSYCHOLOGIA.2018.06.010
Williams, C. C., Ferguson, T. D., Hassall, C. D., Abimbola, W., & Krigolson, O. E. (2021). The ERP, frequency, and time–frequency correlates of feedback processing: Insights from a large sample study. Psychophysiology, 58(2), 1–26. https://doi.org/10.1111/psyp.13722
Williams, R. J. (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8, 229–256.
Wilson, R. C., & Collins, A. G. E. (2019). Ten simple rules for the computational modeling of behavioral data. ELife, 8(e49547), 1–33.
Wilson, R. C., Geana, A., White, J. M., Ludvig, E. A., & Cohen, J. D. (2014). Humans use directed and random exploration to solve the explore-exploit dilemma. Journal of Experimental Psychology: General, 143(6), 2074–2081. https://doi.org/10.1037/a0038199
Worthy, D. A., Hawthorne, M. J., & Otto, A. R. (2013). Heterogeneity of strategy use in the Iowa gambling task: A comparison of win-stay/lose-shift and reinforcement learning models. Psychonomic Bulletin and Review, 20(2), 364–371. https://doi.org/10.3758/s13423-012-0324-9
Wu, C. M., Schulz, E., Speekenbrink, M., Nelson, J. D., & Meder, B. (2018). Generalization guides human exploration in vast decision spaces. In Nature Human Behaviour, 2(12), 915–924. https://doi.org/10.1038/s41562-018-0467-4. Nature Publishing Group.
Wu, C. M., Schulz, E., Garvert, M. M., Meder, B., & Schuck, N. W. (2020). Similarities and differences in spatial and nonspatial cognitive maps. PLOS Computational Biology, 16(10). https://doi.org/10.1371/JOURNAL.PCBI.1008149
Yechiam, E. (2020). Robust consistency of choice switching in decisions from experience. Judgment and Decision Making, 15(1), 74–81. https://doi.org/10.1017/s1930297500006914
Yechiam, E., & Telpaz, A. (2013). Losses Induce Consistency in Risk Taking Even Without Loss Aversion. Journal of Behavioral Decision Making, 26(1), 31–40. https://doi.org/10.1002/bdm.758
Yu, A. J., & Dayan, P. (2003). Expected and unexpected uncertainty: ACh and NE in the neocortex. Advances in Neural Information Processing Systems.
Yu, A. J., & Dayan, P. (2005). Uncertainty, neuromodulation, and attention. Neuron, 46(4), 681–692. https://doi.org/10.1016/j.neuron.2005.04.026
Zajkowski, W. K., Kossut, M., & Wilson, R. C. (2017). A causal role for right frontopolar cortex in directed, but not random, exploration. ELife, 6(e27430), 1–18.
Zhang, S., & Yu, A. J. (2013). Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting. Advances in Neural Information Processing Systems, 26.
Funding
Thomas D. Ferguson would like to acknowledge support from the Dr. Roland and Muriel Haryett Neuroscience Fellowship and the Natural Sciences and Engineering Research Council of Canada. Alona Fyshe would like to acknowledge support from the Canadian Institute for Advanced Research (CIFAR) Canadian AI Chairs program. Adam White would like to acknowledge support from the CIFAR Canadian AI Chairs program. Olave E. Krigolson would like to acknowledge support from the Natural Sciences and Engineering Research Council of Canada (RGPIN 2016–0943). The authors declare that none of the funding sources mentioned above had any involvement in the design of the experiment or the preparation and submission of the manuscript.
Author information
Authors and Affiliations
Contributions
Thomas Ferguson and Olave Krigolson contributed to the study conception and design. Material preparation, data collection, and analysis were performed by Thomas Ferguson. Thomas Ferguson, Alona Fyshe, and Adam White contributed to the computational models used. The first draft of the manuscript was written by Thomas Ferguson and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics Approval
The Human Research Ethics Board at the University of Victoria approved all experimental procedures (Date: 25-Sep-2019; 19–0230), and all research was performed in line with the principles of the Declaration of Helsinki.
Consent to Participate
Participants provided written informed consent prior to the completion of the experimental session.
Consent to Publish
Participants provided consent to have aggregated data (averages) published in a research journal.
Competing Interests
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Correspondence should be directed to: Thomas D. Ferguson, Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada, T6G 2R3.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ferguson, T., Fyshe, A., White, A. et al. Humans Adopt Different Exploration Strategies Depending on the Environment. Comput Brain Behav 6, 671–696 (2023). https://doi.org/10.1007/s42113-023-00178-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42113-023-00178-1