Skip to main content

Models of Human Behavioral Agents in Bandits, Contextual Bandits and RL

Part of the Communications in Computer and Information Science book series (CCIS,volume 1369)


Artificial behavioral agents are often evaluated based on their consistent behaviors and performance to take sequential actions in an environment to maximize some notion of cumulative reward. However, human decision making in real life usually involves different strategies and behavioral trajectories that lead to the same empirical outcome. Motivated by clinical literature of a wide range of neurological and psychiatric disorders, we propose here a more general and flexible parametric framework for sequential decision making that involves a two-stream reward processing mechanism. We demonstrated that this framework is flexible and unified enough to incorporate a family of problems spanning multi-armed bandits (MAB), contextual bandits (CB) and reinforcement learning (RL), which decompose the sequential decision making process in different levels. Inspired by the known reward processing abnormalities of many mental disorders, our clinically-inspired agents demonstrated interesting behavioral trajectories and comparable performance on simulated tasks with particular reward distributions, a real-world dataset capturing human decision-making in gambling tasks, and the PacMan game across different reward stationarities in a lifelong learning setting (The codes to reproduce all the experimental results can be accessed at


  • Reinforcement learning
  • Contextual bandit
  • Neuroscience

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-981-16-1288-6_2
  • Chapter length: 20 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   64.99
Price excludes VAT (USA)
  • ISBN: 978-981-16-1288-6
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   84.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.


  1. 1.

  2. 2.


  1. Agrawal, S., Goyal, N.: Analysis of Thompson Sampling for the multi-armed bandit problem. In: COLT 2012 - The 25th Annual Conference on Learning Theory, Edinburgh, Scotland, 25–27 June 2012, pp. 39.1–39.26 (2012).

  2. Agrawal, S., Goyal, N.: Thompson sampling for contextual bandits with linear payoffs. In: ICML, no. 3, pp. 127–135 (2013)

    Google Scholar 

  3. Auer, P., Cesa-Bianchi, N.: On-line learning with malicious noise and the closure algorithm. Ann. Math. Artif. Intell. 23(1–2), 83–99 (1998)

    MathSciNet  CrossRef  Google Scholar 

  4. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)

    CrossRef  Google Scholar 

  5. Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)

    MathSciNet  CrossRef  Google Scholar 

  6. Bayer, H.M., Glimcher, P.W.: Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron 47(1), 129–141 (2005).

  7. Bechara, A., Damasio, A.R., Damasio, H., Anderson, S.W.: Insensitivity to future consequences following damage to human prefrontal cortex. Cognition 50(1–3), 7–15 (1994)

    CrossRef  Google Scholar 

  8. Beygelzimer, A., Langford, J., Li, L., Reyzin, L., Schapire, R.: Contextual bandit algorithms with supervised learning guarantees. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 19–26 (2011)

    Google Scholar 

  9. Bouneffouf, D., Féraud, R.: Multi-armed bandit problem with known trend. Neurocomputing 205, 16–21 (2016).

  10. Bouneffouf, D., Rish, I., Cecchi, G.A.: Bandit models of human behavior: reward processing in mental disorders. In: Everitt, T., Goertzel, B., Potapov, A. (eds.) AGI 2017. LNCS (LNAI), vol. 10414, pp. 237–248. Springer, Cham (2017).

    CrossRef  Google Scholar 

  11. Bouneffouf, D., Rish, I., Cecchi, G.A., Féraud, R.: Context attentive bandits: contextual bandit with restricted context. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 1468–1475 (2017)

    Google Scholar 

  12. Chapelle, O., Li, L.: An empirical evaluation of Thompson sampling. In: Advances in Neural Information Processing Systems, pp. 2249–2257 (2011)

    Google Scholar 

  13. Dayan, P., Niv, Y.: Reinforcement learning: the good, the bad and the ugly. Curr. Opin. Neurobiol. 18(2), 185–196 (2008)

    CrossRef  Google Scholar 

  14. Elfwing, S., Seymour, B.: Parallel reward and punishment control in humans and robots: Safe reinforcement learning using the MaxPain algorithm. In: 2017 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pp. 140–147. IEEE (2017)

    Google Scholar 

  15. Even-Dar, E., Mansour, Y.: Learning rates for q-learning. J. Mach. Learn. Res. 5, 1–25 (2003)

    MathSciNet  MATH  Google Scholar 

  16. Frank, M.J., O’Reilly, R.C.: A mechanistic account of striatal dopamine function in human cognition: psychopharmacological studies with cabergoline and haloperidol. Behav. Neurosci. 120(3), 497–517 (2006).

  17. Frank, M.J., Seeberger, L.C., O’reilly, R.C.: By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science 306(5703), 1940–1943 (2004)

    Google Scholar 

  18. Fridberg, D.J., et al.: Cognitive mechanisms underlying risky decision-making in chronic cannabis users. J. Math. Psychol. 54(1), 28–38 (2010)

    MathSciNet  CrossRef  Google Scholar 

  19. Hart, A.S., Rutledge, R.B., Glimcher, P.W., Phillips, P.E.M.: Phasic dopamine release in the rat nucleus accumbens symmetrically encodes a reward prediction error term. J. Neurosci. 34(3), 698–704 (2014).

  20. Hasselt, H.V.: Double q-learning. In: Advances in Neural Information Processing Systems, pp. 2613–2621 (2010)

    Google Scholar 

  21. Holmes, A.J., Patrick, L.M.: The myth of optimality in clinical neuroscience. Trends Cogn. Sci. 22(3), 241–257 (2018).

  22. Horstmann, A., Villringer, A., Neumann, J.: Iowa gambling task: there is more to consider than long-term outcome. Using a linear equation model to disentangle the impact of outcome and frequency of gains and losses. Front. Neurosci. 6, 61 (2012)

    CrossRef  Google Scholar 

  23. Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1), 4–22 (1985).

  24. Langford, J., Zhang, T.: The Epoch-Greedy algorithm for contextual multi-armed bandits (2007)

    Google Scholar 

  25. Langford, J., Zhang, T.: The Epoch-Greedy algorithm for multi-armed bandits with side information. In: Advances in Neural Information Processing Systems, pp. 817–824 (2008)

    Google Scholar 

  26. Li, L., Chu, W., Langford, J., Wang, X.: Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In: King, I., Nejdl, W., Li, H. (eds.) WSDM, pp. 297–306. ACM (2011).

  27. Lin, B.: Diabolical games: reinforcement learning environments for lifelong learning (2020)

    Google Scholar 

  28. Lin, B.: Online semi-supervised learning in contextual bandits with episodic reward. arXiv preprint arXiv:2009.08457 (2020)

  29. Lin, B., Bouneffouf, D., Cecchi, G.: Split q learning: reinforcement learning with two-stream rewards. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 6448–6449. AAAI Press (2019)

    Google Scholar 

  30. Lin, B., Bouneffouf, D., Cecchi, G.: Online learning in iterated prisoner’s dilemma to mimic human behavior. arXiv preprint arXiv:2006.06580 (2020)

  31. Lin, B., Bouneffouf, D., Cecchi, G.A., Rish, I.: Contextual bandit with adaptive feature extraction. In: 2018 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 937–944. IEEE (2018)

    Google Scholar 

  32. Lin, B., Bouneffouf, D., Reinen, J., Rish, I., Cecchi, G.: A story of two streams: reinforcement learning models from human behavior and neuropsychiatry. In: Proceedings of the Nineteenth International Conference on Autonomous Agents and Multi-Agent Systems, AAMAS 2020, pp. 744–752. International Foundation for Autonomous Agents and Multiagent Systems, May 2020

    Google Scholar 

  33. Lin, B., Zhang, X.: Speaker diarization as a fully online learning problem in MiniVox. arXiv preprint arXiv:2006.04376 (2020)

  34. Lin, B., Zhang, X.: VoiceID on the fly: a speaker recognition system that learns from scratch. In: INTERSPEECH (2020)

    Google Scholar 

  35. Maia, T.V., Frank, M.J.: From reinforcement learning models to psychiatric and neurological disorders. Nat. Neurosci. 14(2), 154–162 (2011).

    CrossRef  Google Scholar 

  36. O’Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K., Dolan, R.J.: Dissociable roles of ventral and dorsal striatum in instrumental. Science 304, 452–454 (2004).

  37. Perry, D.C., Kramer, J.H.: Reward processing in neurodegenerative disease. Neurocase 21(1), 120–133 (2015)

    CrossRef  Google Scholar 

  38. Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems, vol. 37. University of Cambridge, Department of Engineering Cambridge, England (1994)

    Google Scholar 

  39. Schultz, W., Dayan, P., Montague, P.R.: A neural substrate of prediction and reward. Science 275(5306), 1593–1599 (1997).

  40. Seymour, B., Singer, T., Dolan, R.: The neurobiology of punishment. Nat. Rev. Neurosci. 8(4), 300–311 (2007).

  41. Steingroever, H., et al.: Data from 617 healthy participants performing the iowa gambling task: a “Many Labs” collaboration. J. Open Psychol. Data 3(1), 340–353 (2015)

    Google Scholar 

  42. Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning, 1st edn. MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  43. Sutton, R.S., Barto, A.G., et al.: Introduction to Reinforcement Learning, vol. 135. MIT press Cambridge (1998)

    Google Scholar 

  44. Thompson, W.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933)

    CrossRef  Google Scholar 

  45. Tversky, A., Kahneman, D.: The framing of decisions and the psychology of choice. Science 211(4481), 453–458 (1981). framing of decisions and the psychology of choice.pdf

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Baihan Lin .

Editor information

Editors and Affiliations

A Further Motivation from Neuroscience

A Further Motivation from Neuroscience

In the following section, we provide further discussion with a literature review on the neuroscience and clinical studies related to the reward processing systems.

Cellular Computation of Reward and Reward Violation. Decades of evidence has linked dopamine function to reinforcement learning via neurons in the midbrain and its connections in the basal ganglia, limbic regions, and cortex. Firing rates of dopamine neurons computationally represent reward magnitude, expectancy, and violations (prediction error) and other value-based signals [39]. This allows an animal to update and maintain value expectations associated with particular states and actions. When functioning properly, this helps an animal develop a policy to maximize outcomes by approaching/choosing cues with higher expected value and avoiding cues associated with loss or punishment. The mechanism is conceptually similar to reinforcement learning widely used in computing and robotics [43], suggesting mechanistic overlap in humans and AI. Evidence of Q-learning and actor-critic models have been observed in spiking activity in midbrain dopamine neurons in primates [6] and in the human striatum using the BOLD signal [36].

Positive vs. Negative Learning Signals. Phasic dopamine signaling represents bidirectional (positive and negative) coding for prediction error signals [19], but underlying mechanisms show differentiation for reward relative to punishment learning [40]. Though representation of cellular-level aversive error signaling has been debated [13], it is widely thought that rewarding, salient information is represented by phasic dopamine signals, whereas reward omission or punishment signals are represented by dips or pauses in baseline dopamine firing [39]. These mechanisms have downstream effects on motivation, approach behavior, and action selection. Reward signaling in a direct pathway links striatum to cortex via dopamine neurons that disinhibit the thalamus via the internal segment of the globus pallidus and facilitate action and approach behavior. Alternatively, aversive signals may have an opposite effect in the indirect pathway mediated by D2 neurons inhibiting thalamic function and ultimately action, as well [16]. Manipulating these circuits through pharmacological measures or disease has demonstrated computationally-predictable effects that bias learning from positive or negative prediction error in humans [17], and contribute to our understanding of perceptible differences in human decision making when differentially motivated by loss or gain [45].

Clinical Implications. Highlighting the importance of using computational models to understand predict disease outcomes, many symptoms of neurological and psychiatric disease are related to biases in learning from positive and negative feedback [35]. Studies in humans have shown that when reward signaling in the direct pathway is over-expressed, this may enhance the value associated with a state and incur pathological reward-seeking behavior, like gambling or substance use. Conversely, when aversive error signals are enhanced, this results in dampening of reward experience and increased motor inhibition, causing symptoms that decrease motivation, such as apathy, social withdrawal, fatigue, and depression. Further, it has been proposed that exposure to a particular distribution of experiences during critical periods of development can biologically predispose an individual to learn from positive or negative outcomes, making them more or less susceptible to risk for brain-based illnesses [21]. These points distinctly highlight the need for a greater understanding of how intelligent systems differentially learn from rewards or punishments, and how experience sampling may impact reinforcement learning during influential training periods.

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Singapore Pte Ltd.

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Lin, B., Cecchi, G., Bouneffouf, D., Reinen, J., Rish, I. (2021). Models of Human Behavioral Agents in Bandits, Contextual Bandits and RL. In: Wang, Y. (eds) Human Brain and Artificial Intelligence. HBAI 2021. Communications in Computer and Information Science, vol 1369. Springer, Singapore.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-1287-9

  • Online ISBN: 978-981-16-1288-6

  • eBook Packages: Computer ScienceComputer Science (R0)