Encyclopedia of Computational Neuroscience

Living Edition
| Editors: Dieter Jaeger, Ranu Jung

Reward-Based Learning, Model-Based and Model-Free

  • Quentin J. M. HuysEmail author
  • Peggy Seriès
Living reference work entry

Latest version View entry history

DOI: https://doi.org/10.1007/978-1-4614-7320-6_674-2


Reinforcement learning (RL) techniques are a set of solutions for optimal long-term action choice such that actions take into account both immediate and delayed consequences. They fall into two broad classes: model-based and model-free approaches. Model-based approaches assume an explicit model of the environment and the agent. The model describes the consequences of actions and the associated returns. From this, optimal policies can be inferred. Psychologically, model-based descriptions apply to goal-directed decisions, in which choices reflect current preferences over outcomes. Model-free approaches forget any explicit knowledge of the dynamics of the environment or the consequences of actions and evaluate how good actions are through trial-and-error learning. Model-free values underlie habitual and Pavlovian conditioned responses that are emitted reflexively when faced with certain stimuli. While model-based techniques have substantial computational demands, model-free...

This is a preview of subscription content, log in to check access.


  1. Balleine B, Dickinson A (1994) Role of cholecystokinin in the motivational control of instrumental action in rats. Behav Neurosci 108(3):590–605PubMedCrossRefGoogle Scholar
  2. Barto A, Sutton R, Anderson C (1983) Neuronlike elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern 13(5):834–846CrossRefGoogle Scholar
  3. Bayer HM, Glimcher PW (2005) Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron 47(1):129–141PubMedPubMedCentralCrossRefGoogle Scholar
  4. Bayer HM, Lau B, Glimcher PW (2007) Statistics of midbrain dopamine neuron spike trains in the awake primate. J Neurophysiol 98(3):1428–1439PubMedCrossRefGoogle Scholar
  5. Bellman RE (1957) Dynamic programming. Princeton University Press, PrincetonGoogle Scholar
  6. Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific, BelmontGoogle Scholar
  7. Boutilier C, Dearden R, Goldszmidt M (1995) Exploiting structure in policy construction. In: IJCAI, vol 14, pp 1104–1113Google Scholar
  8. Bouton ME (2006) Learning and behavior: a contemporary synthesis. Sinauer, SunderlandGoogle Scholar
  9. Campbell M, Hoane A et al (2002) Deep Blue. Artif Intell 134(1–2):57–83CrossRefGoogle Scholar
  10. Cardinal RN, Parkinson JA, Lachenal G, Halkerston KM, Rudarakanchana N, Hall J, Morrison CH, Howes SR, Robbins TW, Everitt BJ (2002) Effects of selective excitotoxic lesions of the nucleus accumbens core, anterior cingulate cortex, and central nucleus of the amygdala on autoshaping performance in rats. Behav Neurosci 116(4):553–567PubMedCrossRefGoogle Scholar
  11. Corbit LH, Balleine BW (2005) Double dissociation of basolateral and central amygdala lesions on the general and outcome-specific forms of Pavlovian-instrumental transfer. J Neurosci 25(4):962–970PubMedPubMedCentralCrossRefGoogle Scholar
  12. Corbit LH, Balleine BW (2011) The general and outcome-specific forms of pavlovian-instrumental transfer are differentially mediated by the nucleus accumbens core and shell. J Neurosci 31(33):11786–11794,  https://doi.org/10.1523/JNEUROSCI.2711-11.2011PubMedCrossRefGoogle Scholar
  13. D’Ardenne K, McClure SM, Nystrom LE, Cohen JD (2008) Bold responses reflecting dopaminergic signals in the human ventral tegmental area. Science 319(5867):1264–1267PubMedCrossRefGoogle Scholar
  14. Daw ND, Niv Y, Dayan P (2005) Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat Neurosci 8(12):1704–1711PubMedCrossRefGoogle Scholar
  15. Daw ND, Gershman SJ, Seymour B, Dayan P, Dolan RJ (2011) Model-based influences on humans’ choices and striatal prediction errors. Neuron 69(6):1204–1215PubMedPubMedCentralCrossRefGoogle Scholar
  16. Day JJ, Roitman MF, Wightman RM, Carelli RM (2007) Associative learning mediates dynamic shifts in dopamine signaling in the nucleus accumbens. Nat Neurosci 10(8):1020–1028PubMedCrossRefGoogle Scholar
  17. Dayan P (1993) Improving generalization for temporal difference learning: the successor representation. Neural Comput 5(4):613–624CrossRefGoogle Scholar
  18. Dayan P, Berridge KC (2014) Model-based and model-free pavlovian reward learning: revaluation, revision, and revelation. Cogn Affect Behav Neurosci 14(2):473–492PubMedPubMedCentralCrossRefGoogle Scholar
  19. Dayan P, Niv Y, Seymour B, Daw ND (2006) The misbehavior of value and the discipline of the will. Neural Netw 19(8):1153–1160PubMedCrossRefGoogle Scholar
  20. Dickinson A, Dearing MF (1979) Appetitive-aversive interactions and inhibitory processes. In: Dickinson A, Boakes RA (eds) Mechanisms of learning and motivation. Erlbaum, Hillsdale, pp 203–231Google Scholar
  21. Dickinson A, Smith J, Mirenowicz J (2000) Dissociation of Pavlovian and instrumental incentive learning under dopamine antagonists. Behav Neurosci 114(3):468–483PubMedCrossRefGoogle Scholar
  22. Dietterich TG (1999) Hierarchical reinforcement learning with the maxq value function decomposition. CoRR, cs.LG/9905014Google Scholar
  23. Enomoto K, Matsumoto N, Nakai S, Satoh T, Sato TK, Ueda Y, Inokawa H, Haruno M, Kimura M (2011) Dopamine neurons learn to encode the long-term value of multiple future rewards. Proc Natl Acad Sci U S A 108(37):15462–15467PubMedPubMedCentralCrossRefGoogle Scholar
  24. Flagel SB, Clark JJ, Robinson TE, Mayo L, Czuj A, Willuhn I, Akers CA, Clinton SM, Phillips PEM, Akil H (2011) A selective role for dopamine in stimulus-reward learning. Nature 469(7328):53–57PubMedCrossRefGoogle Scholar
  25. Frank MJ, Seeberger LC, O’Reilly RC (2004) By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science 306(5703):1940–1943PubMedCrossRefGoogle Scholar
  26. Gillan CM, Papmeyer M, Morein-Zamir S, Sahakian BJ, Fineberg NA, Robbins TW, de Wit S (2011) Disruption in the balance between goal-directed behavior and habit learning in obsessive-compulsive disorder. Am J Psychiatry 168(7):718–726PubMedPubMedCentralCrossRefGoogle Scholar
  27. Gillan CM, Kosinski M, Whelan R, Phelps EA, Daw ND (2016) Characterizing a psychiatric symptom dimension related to deficits in goal-directed control. eLife 2016; 5:e11305Google Scholar
  28. Gläscher J, Daw N, Dayan P, O’Doherty JP (2010) States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron 66(4):585–595PubMedPubMedCentralCrossRefGoogle Scholar
  29. Guitart-Masip M, Fuentemilla L, Bach DR, Huys QJM, Dayan P, Dolan RJ, Duzel E (2011) Action dominates valence in anticipatory representations in the human striatum and dopaminergic midbrain. J Neurosci 31(21):7867–7875PubMedPubMedCentralCrossRefGoogle Scholar
  30. Hampton AN, Bossaerts P, O’Doherty JP (2006) The role of the ventromedial prefrontal cortex in abstract state-based inference during decision making in humans. J Neurosci 26(32):8360–8367PubMedPubMedCentralCrossRefGoogle Scholar
  31. Hull C (1943) Principles of behavior. Appleton-Century-Crofts, New YorkGoogle Scholar
  32. Huys QJM (2007) Reinforcers and control. Towards a computational aetiology of depression. PhD thesis, Gatsby Computational Neuroscience Unit, UCL, University of LondonGoogle Scholar
  33. Huys QJM, Cools R, Gölzer M, Friedel E, Heinz A, Dolan RJ, Dayan P (2011) Disentangling the roles of approach, activation and valence in instrumental and Pavlovian responding. PLoS Comput Biol 7(4):e1002028PubMedPubMedCentralCrossRefGoogle Scholar
  34. Huys QJM, Eshel N, O’Nions E, Sheridan L, Dayan P, Roiser JP (2012) Bonsai trees in your head: how the Pavlovian system sculpts goal-directed choices by pruning decision trees. PLoS Comput Biol 8(3):e1002410PubMedPubMedCentralCrossRefGoogle Scholar
  35. Huys QJM, Tobler PN, Hasler G, Flagel SB (2014) The role of learning-related dopamine signals in addiction vulnerability. Prog Brain Res 211:31–77PubMedCrossRefGoogle Scholar
  36. Johnson A, Redish AD (2007) Neural ensembles in ca3 transiently encode paths forward of the animal at a decision point. J Neurosci 27(45):12176–12189PubMedPubMedCentralCrossRefGoogle Scholar
  37. Kaelbling LP, Littman ML, Cassandra AR (1998) Planning and acting in partially observable stochastic domains. Artif Intell 101(1):99–134CrossRefGoogle Scholar
  38. Kamin LJ (1969) Predictability, surprise, attention and conditioning. In: Campbell BA, Church RM (eds) Punishment and aversive behavior. Appleton-Century-Crofts, New YorkGoogle Scholar
  39. Kearns M, Singh S (2002) Near-optimal reinforcement learning in polynomial time. Mach Learn 49(2–3):209–232CrossRefGoogle Scholar
  40. Keramati M, Dezfouli A, Piray P (2011) Speed/accuracy trade-off between the habitual and the goal-directed processes. PLoS Comput Biol 7(5):e1002055PubMedPubMedCentralCrossRefGoogle Scholar
  41. Killcross S, Coutureau E (2003) Coordination of actions and habits in the medial prefrontal cortex of rats. Cereb Cortex 13(4):400–408PubMedCrossRefGoogle Scholar
  42. Knuth D, Moore R (1975) An analysis of alpha-Beta pruning. Artif Intell 6(4):293–326CrossRefGoogle Scholar
  43. Kocsis L, Szepesv’ari C (2006) Bandit based Monte-Carlo planning. In: Machine learning: ECML 2006. Springer, Berlin, pp 282–293CrossRefGoogle Scholar
  44. Maia TV, Frank MJ (2011) From reinforcement learning models to psychiatric and neurological disorders. Nat Neurosci 14(2):154–162PubMedPubMedCentralCrossRefGoogle Scholar
  45. McClure SM, Daw ND, Montague PR (2003) A computational substrate for incentive salience. TINS 26:423–428PubMedGoogle Scholar
  46. McDannald MA, Lucantonio F, Burke KA, Niv Y, Schoenbaum G (2011) Ventral striatum and orbitofrontal cortex are both required for model-based, but not model-free, reinforcement learning. J Neurosci 31(7):2700–2705PubMedPubMedCentralCrossRefGoogle Scholar
  47. Momennejad I, Russek EM, Cheong JH, Botvinick MM, Daw ND, Gershman SJ (2017) The successor representation in human reinforcement learning. Nat Hum Behav 1:680–692PubMedCrossRefGoogle Scholar
  48. Montague PR, Dayan P, Sejnowski TJ (1996) A framework for mesencephalic dopamine systems based on predictive hebbian learning. J Neurosci 16(5):1936–1947PubMedCrossRefGoogle Scholar
  49. Morris G, Nevet A, Arkadir D, Vaadia E, Bergman H (2006) Midbrain dopamine neurons encode decisions for future action. Nat Neurosci 9(8):1057–1063PubMedCrossRefGoogle Scholar
  50. Nebe S, Kroemer NB, Schad DJ, Bernhardt N, Sebold M, Mller DK, Scholl L, Kuitunen-Paul S, Heinz A, Rapp MA, Huys QJM, Smolka MN (2017) No association of goal-directed and habitual control with alcohol consumption in young adults. Addict BiolGoogle Scholar
  51. Nelson A, Killcross S (2006) Amphetamine exposure enhances habit formation. J Neurosci 26(14):3805–3812PubMedPubMedCentralCrossRefGoogle Scholar
  52. Pfeiffer BE, Foster DJ (2013) Hippocampal place-cell sequences depict future paths to remembered goals. Nature 497(7447):74–79PubMedPubMedCentralCrossRefGoogle Scholar
  53. Puterman ML (2005) Markov decision processes: discrete stochastic dynamic programming (Wiley series in probability and statistics). Wiley-Interscience, New YorkGoogle Scholar
  54. Redish AD, Jensen S, Johnson A (2008) A unified framework for addiction: vulnerabilities in the decision process. Behav Brain Sci 31(4):415–437. discussion 437–87PubMedPubMedCentralCrossRefGoogle Scholar
  55. Robbins TW, Gillan CM, Smith DG, de Wit S, Ersche KD (2012) Neurocognitive endophenotypes of impulsivity and compulsivity: towards dimensional psychiatry. Trends Cogn Sci 16(1):81–91PubMedCrossRefGoogle Scholar
  56. Robinson MJF, Berridge KC (2013) Instant transformation of learned repulsion into motivational ‘wanting’. Curr Biol 23(4):282–289PubMedPubMedCentralCrossRefGoogle Scholar
  57. Roesch MR, Calu DJ, Schoenbaum G (2007) Dopamine neurons encode the better option in rats deciding between differently delayed or sized rewards. Nat Neurosci 10(12):1615–1624PubMedPubMedCentralCrossRefGoogle Scholar
  58. Russek EM, Momennejad I, Botvinick MM, Gershman SJ, Daw ND (2017) Predictive representations can link model-based reinforcement learning to model-free mechanisms. PLoS Comput Biol 13:e1005768PubMedPubMedCentralCrossRefGoogle Scholar
  59. Saunders BT, Richard JM, Margolis EB, Janak PH (2018) Dopamine neurons create pavlovian conditioned stimuli with circuit-defined motivational properties. Nat Neurosci 21:1072–1083PubMedPubMedCentralCrossRefGoogle Scholar
  60. Schoenbaum G, Roesch MR, Stalnaker TA, Takahashi YK (2009) A new perspective on the role of the orbitofrontal cortex in adaptive behaviour. Nat Rev Neurosci 10(12):885–892PubMedPubMedCentralCrossRefGoogle Scholar
  61. Schultz W, Romo R (1990) Dopamine neurons of the monkey midbrain: contingencies of responses to stimuli eliciting immediate behavioral reactions. J Neurophysiol 63(3):607–624PubMedCrossRefGoogle Scholar
  62. Schultz W, Dayan P, Montague PR (1997) A neural substrate of prediction and reward. Science 275(5306):1593–1599PubMedCrossRefGoogle Scholar
  63. Sebold M, Nebe S, Garbusow M, Guggenmos M, Schad DJ, Beck A, Kuitunen-Paul S, Sommer C, Frank R, Neu P, Zimmermann US, Rapp MA, Smolka MN, Huys QJM, Schlagenhauf F, Heinz A (2017) When habits are dangerous: alcohol expectancies and habitual decision making predict relapse in alcohol dependence. Biol Psychiatry 82:847–856PubMedCrossRefGoogle Scholar
  64. Smith KS, Graybiel AM (2013) A dual operator view of habitual behavior reflecting cortical and striatal dynamics. Neuron 79(2):361–374PubMedPubMedCentralCrossRefGoogle Scholar
  65. Steinberg EE, Keiflin R, Boivin JR, Witten IB, Deisseroth K, Janak PH (2013) A causal link between prediction errors, dopamine neurons and learning. Nat Neurosci 16(7):966–973PubMedPubMedCentralCrossRefGoogle Scholar
  66. Sutton R (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Proceedings of the seventh international conference on machine learning, vol 216, p 224CrossRefGoogle Scholar
  67. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction (adaptive computation and machine learning). The MIT Press, CambridgeGoogle Scholar
  68. Sutton RS, Precup D, Singh S et al (1999) Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artif Intell 112(1):181–211CrossRefGoogle Scholar
  69. Tobler PN, Fiorillo CD, Schultz W (2005) Adaptive coding of reward value by dopamine neurons. Science 307(5715):1642–1645PubMedCrossRefGoogle Scholar
  70. Tolman EC (1948) Cognitive maps in rats and men. Psychol Rev 55(4):189–208PubMedCrossRefGoogle Scholar
  71. Valentin VV, Dickinson A, O’Doherty JP (2007) Determining the neural substrates of goaldirected learning in the human brain. J Neurosci 27(15):4019–4026PubMedPubMedCentralCrossRefGoogle Scholar
  72. Voon V, Derbyshire K, Rück C, Irvine MA, Worbe Y, Enander J, Schreiber LRN, Gillan C, Fineberg NA, Sahakian BJ, Robbins TW, Harrison NA, Wood J, Daw ND, Dayan P, Grant JE, Bullmore ET (2015) Disorders of compulsivity: a common bias towards learning habits. Mol Psychiatry 20(3):345–352PubMedCrossRefGoogle Scholar
  73. Waelti P, Dickinson A, Schultz W (2001) Dopamine responses comply with basic assumptions of formal learning theory. Nature 412(6842):43–48PubMedCrossRefGoogle Scholar
  74. Watkins C, Dayan P (1992) Q-learning. Mach Learn 8(3):279–292Google Scholar
  75. Wunderlich K, Smittenaar P, Dolan RJ (2012) Dopamine enhances model-based over modelfree choice behavior. Neuron 75(3):418–424PubMedPubMedCentralCrossRefGoogle Scholar
  76. Yin HH, Knowlton BJ, Balleine BW (2004) Lesions of dorsolateral striatum preserve outcome expectancy but disrupt habit formation in instrumental learning. Eur J Neurosci 19(1):181–189PubMedCrossRefGoogle Scholar
  77. Yin HH, Ostlund SB, Knowlton BJ, Balleine BW (2005) The role of the dorsomedial striatum in instrumental conditioning. Eur J Neurosci 22(2):513–523PubMedCrossRefGoogle Scholar
  78. Zaghloul KA, Blanco JA, Weidemann CT, McGill K, Jaggi JL, Baltuch GH, Kahana MJ (2009) Human substantia nigra neurons encode unexpected financial rewards. Science 323(5920):1496–1499PubMedPubMedCentralCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Division of Psychiatry and Max Planck UCL Centre for Computational Psychiatry and Ageing ResearchUniversity College LondonLondonUK
  2. 2.Department of Psychiatry, Psychotherapy and Psychosomatics, Hospital of PsychiatryUniversity of ZürichZürichSwitzerland
  3. 3.Translational Neuromodeling Unit, Department of Biomedical EngineeringETH Zürich and University of ZürichZürichSwitzerland
  4. 4.Institute of Adaptive and Neural ComputationUniversity of EdinburghScotlandUK

Section editors and affiliations

  • Joaquin J. Torres
    • 1
  1. 1.Institute “Carlos I” for Theoretical and Computational Physics and Department of Electromagnetism and Matter Physics, Facultad de CienciasUniversidad de GranadaGranadaSpain