Abstract
Artificial behavioral agents are often evaluated based on their consistent behaviors and performance to take sequential actions in an environment to maximize some notion of cumulative reward. However, human decision making in real life usually involves different strategies and behavioral trajectories that lead to the same empirical outcome. Motivated by clinical literature of a wide range of neurological and psychiatric disorders, we propose here a more general and flexible parametric framework for sequential decision making that involves a two-stream reward processing mechanism. We demonstrated that this framework is flexible and unified enough to incorporate a family of problems spanning multi-armed bandits (MAB), contextual bandits (CB) and reinforcement learning (RL), which decompose the sequential decision making process in different levels. Inspired by the known reward processing abnormalities of many mental disorders, our clinically-inspired agents demonstrated interesting behavioral trajectories and comparable performance on simulated tasks with particular reward distributions, a real-world dataset capturing human decision-making in gambling tasks, and the PacMan game across different reward stationarities in a lifelong learning setting (The codes to reproduce all the experimental results can be accessed at https://github.com/doerlbh/mentalRL.).
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Agrawal, S., Goyal, N.: Analysis of Thompson Sampling for the multi-armed bandit problem. In: COLT 2012 - The 25th Annual Conference on Learning Theory, Edinburgh, Scotland, 25–27 June 2012, pp. 39.1–39.26 (2012). http://www.jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf
Agrawal, S., Goyal, N.: Thompson sampling for contextual bandits with linear payoffs. In: ICML, no. 3, pp. 127–135 (2013)
Auer, P., Cesa-Bianchi, N.: On-line learning with malicious noise and the closure algorithm. Ann. Math. Artif. Intell. 23(1–2), 83–99 (1998)
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)
Bayer, H.M., Glimcher, P.W.: Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron 47(1), 129–141 (2005). https://doi.org/10.1016/j.neuron.2005.05.020. http://www.ncbi.nlm.nih.gov/pubmed/15996553. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC1564381. http://www.linkinghub.elsevier.com/retrieve/pii/S0896627305004678
Bechara, A., Damasio, A.R., Damasio, H., Anderson, S.W.: Insensitivity to future consequences following damage to human prefrontal cortex. Cognition 50(1–3), 7–15 (1994)
Beygelzimer, A., Langford, J., Li, L., Reyzin, L., Schapire, R.: Contextual bandit algorithms with supervised learning guarantees. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 19–26 (2011)
Bouneffouf, D., Féraud, R.: Multi-armed bandit problem with known trend. Neurocomputing 205, 16–21 (2016). https://doi.org/10.1016/j.neucom.2016.02.052
Bouneffouf, D., Rish, I., Cecchi, G.A.: Bandit models of human behavior: reward processing in mental disorders. In: Everitt, T., Goertzel, B., Potapov, A. (eds.) AGI 2017. LNCS (LNAI), vol. 10414, pp. 237–248. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63703-7_22
Bouneffouf, D., Rish, I., Cecchi, G.A., Féraud, R.: Context attentive bandits: contextual bandit with restricted context. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 1468–1475 (2017)
Chapelle, O., Li, L.: An empirical evaluation of Thompson sampling. In: Advances in Neural Information Processing Systems, pp. 2249–2257 (2011)
Dayan, P., Niv, Y.: Reinforcement learning: the good, the bad and the ugly. Curr. Opin. Neurobiol. 18(2), 185–196 (2008)
Elfwing, S., Seymour, B.: Parallel reward and punishment control in humans and robots: Safe reinforcement learning using the MaxPain algorithm. In: 2017 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pp. 140–147. IEEE (2017)
Even-Dar, E., Mansour, Y.: Learning rates for q-learning. J. Mach. Learn. Res. 5, 1–25 (2003)
Frank, M.J., O’Reilly, R.C.: A mechanistic account of striatal dopamine function in human cognition: psychopharmacological studies with cabergoline and haloperidol. Behav. Neurosci. 120(3), 497–517 (2006). https://doi.org/10.1037/0735-7044.120.3.497
Frank, M.J., Seeberger, L.C., O’reilly, R.C.: By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science 306(5703), 1940–1943 (2004)
Fridberg, D.J., et al.: Cognitive mechanisms underlying risky decision-making in chronic cannabis users. J. Math. Psychol. 54(1), 28–38 (2010)
Hart, A.S., Rutledge, R.B., Glimcher, P.W., Phillips, P.E.M.: Phasic dopamine release in the rat nucleus accumbens symmetrically encodes a reward prediction error term. J. Neurosci. 34(3), 698–704 (2014). https://doi.org/10.1523/JNEUROSCI.2489-13.2014. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.645.2368&rep=rep1&type=pdf
Hasselt, H.V.: Double q-learning. In: Advances in Neural Information Processing Systems, pp. 2613–2621 (2010)
Holmes, A.J., Patrick, L.M.: The myth of optimality in clinical neuroscience. Trends Cogn. Sci. 22(3), 241–257 (2018). https://doi.org/10.1016/j.tics.2017.12.006. http://linkinghub.elsevier.com/retrieve/pii/S1364661317302681
Horstmann, A., Villringer, A., Neumann, J.: Iowa gambling task: there is more to consider than long-term outcome. Using a linear equation model to disentangle the impact of outcome and frequency of gains and losses. Front. Neurosci. 6, 61 (2012)
Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1), 4–22 (1985). http://www.cs.utexas.edu/~shivaram
Langford, J., Zhang, T.: The Epoch-Greedy algorithm for contextual multi-armed bandits (2007)
Langford, J., Zhang, T.: The Epoch-Greedy algorithm for multi-armed bandits with side information. In: Advances in Neural Information Processing Systems, pp. 817–824 (2008)
Li, L., Chu, W., Langford, J., Wang, X.: Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In: King, I., Nejdl, W., Li, H. (eds.) WSDM, pp. 297–306. ACM (2011). http://dblp.uni-trier.de/db/conf/wsdm/wsdm2011.html#LiCLW11
Lin, B.: Diabolical games: reinforcement learning environments for lifelong learning (2020)
Lin, B.: Online semi-supervised learning in contextual bandits with episodic reward. arXiv preprint arXiv:2009.08457 (2020)
Lin, B., Bouneffouf, D., Cecchi, G.: Split q learning: reinforcement learning with two-stream rewards. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 6448–6449. AAAI Press (2019)
Lin, B., Bouneffouf, D., Cecchi, G.: Online learning in iterated prisoner’s dilemma to mimic human behavior. arXiv preprint arXiv:2006.06580 (2020)
Lin, B., Bouneffouf, D., Cecchi, G.A., Rish, I.: Contextual bandit with adaptive feature extraction. In: 2018 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 937–944. IEEE (2018)
Lin, B., Bouneffouf, D., Reinen, J., Rish, I., Cecchi, G.: A story of two streams: reinforcement learning models from human behavior and neuropsychiatry. In: Proceedings of the Nineteenth International Conference on Autonomous Agents and Multi-Agent Systems, AAMAS 2020, pp. 744–752. International Foundation for Autonomous Agents and Multiagent Systems, May 2020
Lin, B., Zhang, X.: Speaker diarization as a fully online learning problem in MiniVox. arXiv preprint arXiv:2006.04376 (2020)
Lin, B., Zhang, X.: VoiceID on the fly: a speaker recognition system that learns from scratch. In: INTERSPEECH (2020)
Maia, T.V., Frank, M.J.: From reinforcement learning models to psychiatric and neurological disorders. Nat. Neurosci. 14(2), 154–162 (2011). https://doi.org/10.1038/nn.2723
O’Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K., Dolan, R.J.: Dissociable roles of ventral and dorsal striatum in instrumental. Science 304, 452–454 (2004). https://doi.org/10.1126/science.1094285. http://www.sciencemag.org/content/304/5669/452.full.html. http://www.sciencemag.org/content/suppl/2004/04/13/304.5669.452.DC1.html. http://www.sciencemag.org/content/304/5669/452.full.html#related-urls. http://www.sciencemag.org/cgi/collection/neuroscience
Perry, D.C., Kramer, J.H.: Reward processing in neurodegenerative disease. Neurocase 21(1), 120–133 (2015)
Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems, vol. 37. University of Cambridge, Department of Engineering Cambridge, England (1994)
Schultz, W., Dayan, P., Montague, P.R.: A neural substrate of prediction and reward. Science 275(5306), 1593–1599 (1997). https://doi.org/10.1126/science.275.5306.1593. http://www.sciencemag.org/cgi/doi/10.1126/science.275.5306.1593
Seymour, B., Singer, T., Dolan, R.: The neurobiology of punishment. Nat. Rev. Neurosci. 8(4), 300–311 (2007). https://doi.org/10.1038/nrn2119. http://www.nature.com/articles/nrn2119
Steingroever, H., et al.: Data from 617 healthy participants performing the iowa gambling task: a “Many Labs” collaboration. J. Open Psychol. Data 3(1), 340–353 (2015)
Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning, 1st edn. MIT Press, Cambridge (1998)
Sutton, R.S., Barto, A.G., et al.: Introduction to Reinforcement Learning, vol. 135. MIT press Cambridge (1998)
Thompson, W.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933)
Tversky, A., Kahneman, D.: The framing of decisions and the psychology of choice. Science 211(4481), 453–458 (1981). https://fenix.tecnico.ulisboa.pt/downloadFile/3779576281111/The framing of decisions and the psychology of choice.pdf
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Further Motivation from Neuroscience
A Further Motivation from Neuroscience
In the following section, we provide further discussion with a literature review on the neuroscience and clinical studies related to the reward processing systems.
Cellular Computation of Reward and Reward Violation. Decades of evidence has linked dopamine function to reinforcement learning via neurons in the midbrain and its connections in the basal ganglia, limbic regions, and cortex. Firing rates of dopamine neurons computationally represent reward magnitude, expectancy, and violations (prediction error) and other value-based signals [39]. This allows an animal to update and maintain value expectations associated with particular states and actions. When functioning properly, this helps an animal develop a policy to maximize outcomes by approaching/choosing cues with higher expected value and avoiding cues associated with loss or punishment. The mechanism is conceptually similar to reinforcement learning widely used in computing and robotics [43], suggesting mechanistic overlap in humans and AI. Evidence of Q-learning and actor-critic models have been observed in spiking activity in midbrain dopamine neurons in primates [6] and in the human striatum using the BOLD signal [36].
Positive vs. Negative Learning Signals. Phasic dopamine signaling represents bidirectional (positive and negative) coding for prediction error signals [19], but underlying mechanisms show differentiation for reward relative to punishment learning [40]. Though representation of cellular-level aversive error signaling has been debated [13], it is widely thought that rewarding, salient information is represented by phasic dopamine signals, whereas reward omission or punishment signals are represented by dips or pauses in baseline dopamine firing [39]. These mechanisms have downstream effects on motivation, approach behavior, and action selection. Reward signaling in a direct pathway links striatum to cortex via dopamine neurons that disinhibit the thalamus via the internal segment of the globus pallidus and facilitate action and approach behavior. Alternatively, aversive signals may have an opposite effect in the indirect pathway mediated by D2 neurons inhibiting thalamic function and ultimately action, as well [16]. Manipulating these circuits through pharmacological measures or disease has demonstrated computationally-predictable effects that bias learning from positive or negative prediction error in humans [17], and contribute to our understanding of perceptible differences in human decision making when differentially motivated by loss or gain [45].
Clinical Implications. Highlighting the importance of using computational models to understand predict disease outcomes, many symptoms of neurological and psychiatric disease are related to biases in learning from positive and negative feedback [35]. Studies in humans have shown that when reward signaling in the direct pathway is over-expressed, this may enhance the value associated with a state and incur pathological reward-seeking behavior, like gambling or substance use. Conversely, when aversive error signals are enhanced, this results in dampening of reward experience and increased motor inhibition, causing symptoms that decrease motivation, such as apathy, social withdrawal, fatigue, and depression. Further, it has been proposed that exposure to a particular distribution of experiences during critical periods of development can biologically predispose an individual to learn from positive or negative outcomes, making them more or less susceptible to risk for brain-based illnesses [21]. These points distinctly highlight the need for a greater understanding of how intelligent systems differentially learn from rewards or punishments, and how experience sampling may impact reinforcement learning during influential training periods.
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Lin, B., Cecchi, G., Bouneffouf, D., Reinen, J., Rish, I. (2021). Models of Human Behavioral Agents in Bandits, Contextual Bandits and RL. In: Wang, Y. (eds) Human Brain and Artificial Intelligence. HBAI 2021. Communications in Computer and Information Science, vol 1369. Springer, Singapore. https://doi.org/10.1007/978-981-16-1288-6_2
Download citation
DOI: https://doi.org/10.1007/978-981-16-1288-6_2
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-1287-9
Online ISBN: 978-981-16-1288-6
eBook Packages: Computer ScienceComputer Science (R0)