The field of reinforcement learning has greatly influenced the neuroscientific study of conditioning. This article provides an introduction to reinforcement learning followed by an examination of the successes and challenges using reinforcement learning to understand the neural bases of conditioning. Successes reviewed include (1) the mapping of positive and negative prediction errors to the firing of dopamine neurons and neurons in the lateral habenula, respectively; (2) the mapping of model-based and model-free reinforcement learning to associative and sensorimotor cortico-basal ganglia-thalamo-cortical circuits, respectively; and (3) the mapping of actor and critic to the dorsal and ventral striatum, respectively. Challenges reviewed consist of several behavioral and neural findings that are at odds with standard reinforcement-learning models, including, among others, evidence for hyperbolic discounting and adaptive coding. The article suggests ways of reconciling reinforcement-learning models with many of the challenging findings, and highlights the need for further theoretical developments where necessary. Additional information related to this study may be downloaded from http://cabn.psychonomic-journals.org/content/supplemental.
Abler, B., Walter, H., Erk, S., Kammerer, H., & Spitzer, M. (2006). Prediction error as a linear function of reward probability is coded in human nucleus accumbens. NeuroImage, 31, 790–795.
Adams, C. D. (1982). Variations in the sensitivity of instrumental responding to reinforcer devaluation. Quarterly Journal of Experimental Psychology, 34B, 77–98.
Ainslie, G. (1975). Specious reward: A behavioral theory of impulsiveness and impulse control. Psychological Bulletin, 82, 463–496.
Alexander, G. E., DeLong, M. R., & Strick, P. L. (1986). Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annual Review of Neuroscience, 9, 357–381.
Aron, A. R., Shohamy, D., Clark, J., Myers, C., Gluck, M. A., & Poldrack, R. A. (2004). Human midbrain sensitivity to cognitive feedback and uncertainty during classification learning. Journal of Neurophysiology, 92, 1144–1152.
Barnes, T. D., Kubota, Y., Hu, D., Jin, D. Z., & Graybiel, A. M. (2005). Activity of striatal neurons reflects dynamic encoding and recoding of procedural memories. Nature, 437, 1158–1161.
Barto, A. G. (1995). Adaptive critics and the basal ganglia. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 215–232). Cambridge, MA: MIT Press.
Barto, A. G., & Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems: Theory & Applications, 13, 343–379.
Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, & Cybernetics, 13, 834–846.
Bayer, H. M., & Glimcher, P. W. (2005). Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron, 47, 129–141.
Bayer, H. M., Lau, B., & Glimcher, P. W. (2007). Statistics of midbrain dopamine neuron spike trains in the awake primate. Journal of Neurophysiology, 98, 1428–1439.
Bellman, R. E. (1957). Dynamic programming. Princeton, NJ: Princeton University Press.
Belova, M. A., Paton, J. J., & Salzman, C. D. (2008). Moment-tomoment tracking of state value in the amygdala. Journal of Neuroscience, 28, 10023–10030.
Bernoulli, D. (1954). Exposition of a new theory on the measurement of risk. Econometrica, 22, 23–36. (Original work published 1738)
Berns, G. S., Capra, C. M., Chappelow, J., Moore, S., & Noussair, C. (2008). Nonlinear neurobiological probability weighting functions for aversive outcomes. NeuroImage, 39, 2047–2057.
Botvinick, M. M., Niv, Y., & Barto, A. G. (in press). Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective. Cognition. doi:10.1016/j.cognition.2008.08.011
Botvinick, M. M., & Plaut, D. C. (2004). Doing without schema hierarchies: A recurrent connectionist approach to normal and impaired routine sequential action. Psychological Review, 111, 395–429.
Bradtke, S. J., & Duff, M. O. (1995). Reinforcement learning methods for continuous-time Markov decision problems. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems (Vol. 7, pp. 393–400). Cambridge, MA: MIT Press.
Bray, S., & O’Doherty, J. (2007). Neural coding of reward-prediction error signals during classical conditioning with attractive faces. Journal of Neurophysiology, 97, 3036–3045.
Brischoux, F., Chakraborty, S., Brierley, D. I., & Ungless, M. A. (2009). Phasic excitation of dopamine neurons in ventral VTA by noxious stimuli. Proceedings of the National Academy of Sciences, 106, 4894–4899.
Brown, L. L., & Wolfson, L. I. (1983). A dopamine-sensitive striatal efferent system mapped with [14C]deoxyglucose in the rat. Brain Research, 261, 213–229.
Calabresi, P., Pisani, A., Centonze, D., & Bernardi, G. (1997). Synaptic plasticity and physiological interactions between dopamine and glutamate in the striatum. Neuroscience & Biobehavioral Reviews, 21, 519–523.
Camerer, C. F., & Loewenstein, G. (2004). Behavioral economics: Past, present, future. In C. F. Camerer, G. Loewenstein, & M. Rabin (Eds.), Advances in behavioral economics (pp. 3–51). Princeton, NJ: Princeton University Press.
Cardinal, R. N., Parkinson, J. A., Hall, J., & Everitt, B. J. (2002). Emotion and motivation: The role of the amygdala, ventral striatum, and prefrontal cortex. Neuroscience & Biobehavioral Reviews, 26, 321–352.
Cassandra, A. R., Kaelbling, L. P., & Littman, M. L. (1994). Acting optimally in partially observable stochastic domains. In Proceedings of the 12th National Conference on Artificial Intelligence (pp. 1023–1028). Menlo Park, CA: AAAI Press.
Cavada, C., Company, T., Tejedor, J., Cruz-Rizzolo, R. J., & Reinoso-Suarez, F. (2000). The anatomical connections of the macaque monkey orbitofrontal cortex: A review. Cerebral Cortex, 10, 220–242.
Christoph, G. R., Leonzio, R. J., & Wilcox, K. S. (1986). Stimulation of the lateral habenula inhibits dopamine-containing neurons in the substantia nigra and ventral tegmental area of the rat. Journal of Neuroscience, 6, 613–619.
Cools, R., Robinson, O. J., & Sahakian, B. (2008). Acute tryptophan depletion in healthy volunteers enhances punishment prediction but does not affect reward prediction. Neuropsychopharmacology, 33, 2291–2299.
D’Ardenne, K., McClure, S. M., Nystrom, L. E., & Cohen, J. D. (2008). BOLD responses reflecting dopaminergic signals in the human ventral tegmental area. Science, 319, 1264–1267.
Daw, N. D. (2003). Reinforcement learning models of the dopamine system and their behavioral implications. Unpublished doctoral dissertation, Carnegie Mellon University, Pittsburgh.
Daw, N. D., Courville, A. C., & Dayan, P. (2008). Semi-rational models of conditioning: The case of trial order. In N. Chater & M. Oaksford (Eds.), The probabilistic mind: Prospects for Bayesian cognitive science (pp. 431–452). Oxford: Oxford University Press.
Daw, N. D., Courville, A. C., & Touretzky, D. S. (2006). Representation and timing in theories of the dopamine system. Neural Computation, 18, 1637–1677.
Daw, N. D., Kakade, S., & Dayan, P. (2002). Opponent interactions between serotonin and dopamine. Neural Networks, 15, 603–616.
Daw, N. D., Niv, Y., & Dayan, P. (2005). Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience, 8, 1704–1711.
Daw, N. D., Niv, Y., & Dayan, P. (2006). Actions, policies, values, and the basal ganglia. In E. Bezard (Ed.), Recent breakthroughs in basal ganglia research (pp. 111–130). New York: Nova Science.
Day, J. J., Roitman, M. F., Wightman, R. M., & Carelli, R. M. (2007). Associative learning mediates dynamic shifts in dopamine signaling in the nucleus accumbens. Nature Neuroscience, 10, 1020–1028.
Dearden, R., Friedman, N., & Russell, S. (1998). Bayesian Q- learning. In Proceedings of the 15th National Conference on Artificial Intelligence (pp. 761–768). Menlo Park, CA: AAAI Press.
De Pisapia, N., & Goddard, N. H. (2003). A neural model of fronto striatal interactions for behavioural planning and action chunking. Neurocomputing, 52–54, 489–495. doi:10.1016/S0925-2312(02)00753-1
Dickinson, A. (1985). Actions and habits: The development of behavioural autonomy. Philosophical Transactions of the Royal Society B, 308, 67–78.
Dickinson, A. (1994). Instrumental conditioning. In N. J. Mackintosh (Ed.), Animal learning and cognition (pp. 45–79). San Diego: Academic Press.
Domjan, M. (2003). The principles of learning and behavior (5th ed.). Belmont, CA: Thomson/Wadsworth.
Doya, K. (1996). Temporal difference learning in continuous time and space. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems (Vol. 8, pp. 1073–1079). Cambridge, MA: MIT Press.
Eblen, F., & Graybiel, A. M. (1995). Highly restricted origin of prefrontal cortical inputs to striosomes in the macaque monkey. Journal of Neuroscience, 15, 5999–6013.
Elliott, R., Newman, J. L., Longe, O. A., & Deakin, J. F. W. (2004). Instrumental responding for rewards is associated with enhanced neuronal response in subcortical reward systems. NeuroImage, 21, 984–990.
Elster, J. (1979). Ulysses and the sirens: Studies in rationality and irrationality. Cambridge: Cambridge University Press.
Engel, Y., Mannor, S., & Meir, R. (2003). Bayes meets Bellman: The Gaussian process approach to temporal difference learning. In Proceedings of the 20th International Conference on Machine Learning (pp. 154–161). Menlo Park, CA: AAAI Press.
Ferraro, G., Montalbano, M. E., Sardo, P., & La Grutta, V. (1996). Lateral habenular influence on dorsal raphe neurons. Brain Research Bulletin, 41, 47–52. doi:10.1016/0361-9230(96)00170-0
Fiorillo, C. D., Tobler, P. N., & Schultz, W. (2003). Discrete coding of reward probability and uncertainty by dopamine neurons. Science, 299, 1898–1902.
Fiorillo, C. D., Tobler, P. N., & Schultz, W. (2005). Evidence that the delay-period activity of dopamine neurons corresponds to reward uncertainty rather than backpropagating TD errors. Behavioral & Brain Functions, 1, 7.
Frederick, S., Loewenstein, G., & O’Donoghue, T. (2002). Time discounting and time preference: A critical review. Journal of Economic Literature, 40, 351–401.
Fujii, N., & Graybiel, A. M. (2005). Time-varying covariance of neural activities recorded in striatum and frontal cortex as monkeys perform sequential-saccade tasks. Proceedings of the National Academy of Sciences, 102, 9032–9037.
Gao, D. M., Hoffman, D., & Benabid, A. L. (1996). Simultaneous recording of spontaneous activities and nociceptive responses from neurons in the pars compacta of substantia nigra and in the lateral habenula. European Journal of Neuroscience, 8, 1474–1478.
Geisler, S., Derst, C., Veh, R. W., & Zahm, D. S. (2007). Glutamatergic afferents of the ventral tegmental area in the rat. Journal of Neuroscience, 27, 5730–5743.
Geisler, S., & Trimble, M. (2008). The lateral habenula: No longer neglected. CNS Spectrums, 13, 484–489.
Gerfen, C. R. (1984). The neostriatal mosaic: Compartmentalization of corticostriatal input and striatonigral output systems. Nature, 311, 461–464. doi:10.1038/311461a0
Gerfen, C. R. (1985). The neostriatal mosaic. I. Compartmental organization of projections from the striatum to the substantia nigra in the rat. Journal of Comparative Neurology, 236, 454–476.
Grace, A. A. (1991). Phasic versus tonic dopamine release and the modulation of dopamine system responsivity: A hypothesis for the etiology of schizophrenia. Neuroscience, 41, 1–24.
Grace, A. A. (2000). The tonic/phasic model of dopamine system regulation and its implications for understanding alcohol and psychostimulant craving. Addiction, 95(Suppl. 2), S119-S128.
Gray, T. S. (1999). Functional and anatomical relationships among the amygdala, basal forebrain, ventral striatum, and cortex: An integrative discussion. In J. F. McGinty (Ed.), Advancing from the ventral striatum to the amygdala: Implications for neuropsychiatry and drug abuse (Annals of the New York Academy of Sciences, Vol. 877, pp. 439–444). New York: New York Academy of Sciences.
Graybiel, A. M. (1990). Neurotransmitters and neuromodulators in the basal ganglia. Trends in Neurosciences, 13, 244–254.
Graybiel, A. M. (1998). The basal ganglia and chunking of action repertoires. Neurobiology of Learning & Memory, 70, 119–136.
Graybiel, A. M., & Ragsdale, C. W., Jr. (1978). Histochemically distinct compartments in the striatum of human, monkeys, and cat demonstrated by acetylthiocholinesterase staining. Proceedings of the National Academy of Sciences, 75, 5723–5726.
Green, L., & Myerson, J. (2004). A discounting framework for choice with delayed and probabilistic rewards. Psychological Bulletin, 130, 769–792.
Guarraci, F. A., & Kapp, B. S. (1999). An electrophysiological characterization of ventral tegmental area dopaminergic neurons during differential Pavlovian fear conditioning in the awake rabbit. Behavioural Brain Research, 99, 169–179.
Haber, S. N. (2003). The primate basal ganglia: Parallel and integrative networks. Journal of Chemical Neuroanatomy, 26, 317–330.
Haber, S. N., & Fudge, J. L. (1997). The interface between dopamine neurons and the amygdala: Implications for schizophrenia. Schizophrenia Bulletin, 23, 471–482. doi:10.1093/schbul/23.3.471
Hastie, R., & Dawes, R. M. (2001). Rational choice in an uncertain world: The psychology of judgment and decision making. New York: Sage.
Herkenham, M., & Nauta, W. J. (1979). Efferent connections of the habenular nuclei in the rat. Journal of Comparative Neurology, 187, 19–47.
Hertwig, R., Barron, G., Weber, E. U., & Erev, I. (2004). Decisions from experience and the effect of rare events in risky choice. Psychological Science, 15, 534–539. doi:10.1111/j.0956-7976.2004.00715.x
Hikosaka, K., & Watanabe, M. (2000). Delay activity of orbital and lateral prefrontal neurons of the monkey varying with different rewards. Cerebral Cortex, 10, 263–271.
Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed representations. In D. E. Rumelhart, J. L. McClelland, & the PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition. Vol. 1: Foundations (pp. 77–109). Cambridge, MA: MIT Press.
Ho, M.-Y., Mobini, S., Chiang, T.-J., Bradshaw, C. M., & Szabadi, E. (1999). Theory and method in the quantitative analysis of “impulsive choice” behaviour: Implications for psychopharmacology. Psychopharmacology, 146, 362–372.
Hollerman, J. R., & Schultz, W. (1998). Dopamine neurons report an error in the temporal prediction of reward during learning. Nature Neuroscience, 1, 304–309.
Horvitz, J. C. (2000). Mesolimbocortical and nigrostriatal dopamine responses to salient non-reward events. Neuroscience, 96, 651–656.
Houk, J. C., Adams, J. L., & Barto, A. G. (1995). A model of how the basal ganglia generate and use neural signals that predict reinforcement. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 249–270). Cambridge, MA: MIT Press.
Hsu, M., Krajbich, I., Zhao, C., & Camerer, C. F. (2009). Neural response to reward anticipation under risk is nonlinear in probabilities. Journal of Neuroscience, 29, 2231–2237. doi:10.1523/jneurosci.5296-08.2009
Huettel, S. A., Stowe, C. J., Gordon, E. M., Warner, B. T., & Platt, M. L. (2006). Neural signatures of economic preferences for risk and ambiguity. Neuron, 49, 765–775.
Jay, T. M. (2003). Dopamine: A potential substrate for synaptic plasticity and memory mechanisms. Progress in Neurobiology, 69, 375–390. doi:10.1016/S0301-0082(03)00085-6
Ji, H., & Shepard, P. D. (2007). Lateral habenula stimulation inhibits rat midbrain dopamine neurons through a GABAA receptor-mediated mechanism. Journal of Neuroscience, 27, 6923–6930. doi:10.1523/ jneurosci.0958-07.2007
Joel, D., Niv, Y., & Ruppin, E. (2002). Actor-critic models of the basal ganglia: New anatomical and computational perspectives. Neural Networks, 15, 535–547.
Joel, D., & Weiner, I. (2000). The connections of the dopaminergic system with the striatum in rats and primates: An analysis with respect to the functional and compartmental organization of the striatum. Neuroscience, 96, 451–474.
Jog, M. S., Kubota, Y., Connolly, C. I., Hillegaart, V., & Graybiel, A. M. (1999). Building neural representations of habits. Science, 286, 1745–1749.
Johnson, A., van der Meer, M. A. A., & Redish, A. D. (2007). Integrating hippocampus and striatum in decision-making. Current Opinion in Neurobiology, 17, 692–697.
Kable, J. W., & Glimcher, P. W. (2007). The neural correlates of subjective value during intertemporal choice. Nature Neuroscience, 10, 1625–1633.
Kacelnik, A. (1997). Normative and descriptive models of decision making: Time discounting and risk sensitivity. In G. R. Bock & G. Cardew (Eds.), Characterizing human psychological adaptations (Ciba Foundation Symposium, No. 208, pp. 51–70). New York: Wiley.
Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101, 99–134.
Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285.
Kahneman, D., & Tversky, A. (1979). Prospect theory: An analysis of decision under risk. Econometrica, 47, 263–291.
Kalen, P., Strecker, R. E., Rosengren, E., & Bjorklund, A. (1989). Regulation of striatal serotonin release by the lateral habenula-dorsal raphe pathway in the rat as demonstrated by in vivo microdialysis: Role of excitatory amino acids and GABA. Brain Research, 492, 187–202.
Killcross, S., & Coutureau, E. (2003). Coordination of actions and habits in the medial prefrontal cortex of rats. Cerebral Cortex, 13, 400–408.
Kim, S., Hwang, J., & Lee, D. (2008). Prefrontal coding of temporally discounted values during intertemporal choice. Neuron, 59, 161–172.
Kirkland, K. L. (2002). High-tech brains: A history of technologybased analogies and models of nerve and brain function. Perspectives in Biology & Medicine, 45, 212–223. doi:10.1353/pbm.2002.0033
Knight, F. H. (1921). Risk, uncertainty and profit. Boston: Houghton Mifflin.
Knowlton, B. J., Mangels, J. A., & Squire, L. R. (1996). A neostriatal habit learning system in humans. Science, 273, 1399–1402.
Knutson, B., & Gibbs, S. E. (2007). Linking nucleus accumbens dopamine and blood oxygenation. Psychopharmacology, 191, 813–822.
Kobayashi, S., & Schultz, W. W. (2008). Influence of reward delays on responses of dopamine neurons. Journal of Neuroscience, 28, 7837–7846. doi:10.1523/jneurosci.1600-08.2008
Kozlowski, M. R., & Marshall, J. F. (1980). Plasticity of [14C]2-deoxy-D-glucose incorporation into neostriatum and related structures in response to dopamine neuron damage and apomorphine replacement. Brain Research, 197, 167–183.
Laibson, D. (1997). Golden eggs and hyperbolic discounting. Quarterly Journal of Economics, 112, 443–477.
Lévesque, M., & Parent, A. (2005). The striatofugal fiber system in primates: A reevaluation of its organization based on single-axon tracing studies. Proceedings of the National Academy of Sciences, 102, 11888–11893. doi:10.1073/pnas.0502710102
Loewenstein, G. (1996). Out of control: Visceral influences on behavior. Organizational Behavior & Human Decision Processes, 65, 272–292.
Logothetis, N. K., Pauls, J., Augath, M., Trinath, T., & Oeltermann, A. (2001). Neurophysiological investigation of the basis of the fMRI signal. Nature, 412, 150–157. doi:10.1038/35084005
Ludvig, E. A., Sutton, R. S., & Kehoe, E. J. (2008). Stimulus representation and the timing of reward-prediction errors in models of the dopamine system. Neural Computation, 20, 3034–3054.
Mantz, J., Thierry, A. M., & Glowinski, J. (1989). Effect of noxious tail pinch on the discharge rate of mesocortical and mesolimbic do pamine neurons: Selective activation of the mesocortical system. Brain Research, 476, 377–381.
Matsumoto, M., & Hikosaka, O. (2007). Lateral habenula as a source of negative reward signals in dopamine neurons. Nature, 447, 1111–1115.
Matsumoto, M., & Hikosaka, O. (2009a). Representation of negative motivational value in the primate lateral habenula. Nature Neuroscience, 12, 77–84.
Matsumoto, M., & Hikosaka, O. (2009b). Two types of dopamine neuron distinctly convey positive and negative motivational signals. Nature, 459, 837–841.
Mazur, J. E. (1987). An adjusting procedure for studying delayed reinforcement. In M. L. Commons, J. E. Mazur, J. A. Nevin, & H. Rachlin (Eds.), Quantitative analyses of behavior: Vol. 5. The effect of delay and of intervening events on reinforcement value (pp. 55–73). Hillsdale, NJ: Erlbaum.
Mazur, J. E. (2001). Hyperbolic value addition and general models of animal choice. Psychological Review, 108, 96–112.
Mazur, J. E. (2007). Choice in a successive-encounters procedure and hyperbolic decay of reinforcement. Journal of the Experimental Analysis of Behavior, 88, 73–85.
McClure, S. M., Berns, G. S., & Montague, P. R. (2003). Temporal prediction errors in a passive learning task activate human striatum. Neuron, 38, 339–346.
McClure, S. M., Laibson, D. I., Loewenstein, G., & Cohen, J. D. (2004). Separate neural systems value immediate and delayed monetary rewards. Science, 306, 503–507.
McCoy, A. N., & Platt, M. L. (2005). Risk-sensitive neurons in macaque posterior cingulate cortex. Nature Neuroscience, 8, 1220–1227.
McCulloch, J., Savaki, H. E., & Sokoloff, L. (1980). Influence of dopaminergic systems on the lateral habenular nucleus of the rat. Brain Research, 194, 117–124.
Metcalfe, J., & Mischel, W. (1999). A hot/cool-system analysis of delay of gratification: Dynamics of willpower. Psychological Review, 106, 3–19.
Michie, D. (1961). Trial and error. In S. A. Barnett & A. McLaren (Eds.), Science survey (Part 2, pp. 129–145). Harmondsworth, U.K.: Penguin.
Middleton, F. A., & Strick, P. L. (2001). A revised neuroanatomy of frontal-subcortical circuits. In D. G. Lichter & J. L. Cummings (Eds.), Frontal-subcortical circuits in psychiatric and neurological disorders. New York: Guilford.
Miller, G. A., Galanter, E., & Pribram, K. H. (1960). Plans and the structure of behavior. New York: Holt, Rinehart & Winston.
Minsky, M. (1963). Steps toward artificial intelligence. In E. A. Feigenbaum & J. Feldman (Eds.), Computers and thought (pp. 406–450). New York: McGraw-Hill.
Mirenowicz, J., & Schultz, W. (1994). Importance of unpredictability for reward responses in primate dopamine neurons. Journal of Neurophysiology, 72, 1024–1027.
Mirenowicz, J., & Schultz, W. (1996). Preferential activation of midbrain dopamine neurons by appetitive rather than aversive stimuli. Nature, 379, 449–451.
Monahan, G. E. (1982). A survey of partially observable Markov decision processes: Theory, models, and algorithms. Management Science, 28, 1–16.
Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16, 1936–1947.
Morecraft, R. J., Geula, C., & Mesulam, M. M. (1992). Cytoarchitecture and neural afferents of orbitofrontal cortex in the brain of the monkey. Journal of Comparative Neurology, 323, 341–358.
Morris, G., Arkadir, D., Nevet, A., Vaadia, E., & Bergman, H. (2004). Coincident but distinct messages of midbrain dopamine and striatal tonically active neurons. Neuron, 43, 133–143.
Myerson, J., & Green, L. (1995). Discounting of delayed rewards: Models of individual choice. Journal of the Experimental Analysis of Behavior, 64, 263–276.
Nakahara, H., Itoh, H., Kawagoe, R., Takikawa, Y., & Hikosaka, O. (2004). Dopamine neurons can represent context-dependent prediction error. Neuron, 41, 269–280.
Nakamura, K., Matsumoto, M., & Hikosaka, O. (2008). Reward-dependent modulation of neuronal activity in the primate dorsal raphe nucleus. Journal of Neuroscience, 28, 5331–5343.
Niv, Y., Duff, M. O., & Dayan, P. (2005). Dopamine, uncertainty and TD learning. Behavioral & Brain Functions, 1, 6. doi:10.1186/1744-9081-1-6
Niv, Y., & Schoenbaum, G. (2008). Dialogues on prediction errors. Trends in Cognitive Sciences, 12, 265–272.
Oades, R. D., & Halliday, G. M. (1987). Ventral tegmental (A10) system: Neurobiology. 1. Anatomy and connectivity. Brain Research, 434, 117–165.
O’Doherty, J., Dayan, P., Friston, K., Critchley, H., & Dolan, R. J. (2003). Temporal difference models and reward-related learning in the human brain. Neuron, 38, 329–337.
O’Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K., & Dolan, R. J. (2004). Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science, 304, 452–454.
O’Donoghue, T., & Rabin, M. (1999). Doing it now or later. American Economic Review, 89, 103–124.
Ongur, D., An, X., & Price, J. L. (1998). Prefrontal cortical projections to the hypothalamus in macaque monkeys. Journal of Comparative Neurology, 401, 480–505.
Packard, M. G., & Knowlton, B. J. (2002). Learning and memory functions of the basal ganglia. Annual Review of Neuroscience, 25, 563–593.
Pagnoni, G., Zink, C. F., Montague, P. R., & Berns, G. S. (2002). Activity in human ventral striatum locked to errors of reward prediction. Nature Neuroscience, 5, 97–98.
Park, M. R. (1987). Monosynaptic inhibitory postsynaptic potentials from lateral habenula recorded in dorsal raphe neurons. Brain Research Bulletin, 19, 581–586.
Parr, R. (1998). Hierarchical control and learning for Markov decision processes. Unpublished doctoral dissertation, University of California, Berkeley.
Paton, J. J., Belova, M. A., Morrison, S. E., & Salzman, C. D. (2006). The primate amygdala represents the positive and negative value of visual stimuli during learning. Nature, 439, 865–870.
Paulus, M. P., & Frank, L. R. (2006). Anterior cingulate activity modulates nonlinear decision weight function of uncertain prospects. NeuroImage, 30, 668–677.
Pessiglione, M., Seymour, B., Flandin, G., Dolan, R. J., & Frith, C. D. (2006). Dopamine-dependent prediction errors underpin rewardseeking behaviour in humans. Nature, 442, 1042–1045.
Poupart, P., Vlassis, N., Hoey, J., & Regan, K. (2006). An analytic solution to discrete Bayesian reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning (pp. 697–704). New York: ACM.
Prelec, D., & Loewenstein, G. (1991). Decision making over time and under uncertainty: A common approach. Management Science, 37, 770–786.
Preuschoff, K., & Bossaerts, P. (2007). Adding prediction risk to the theory of reward learning. In B. W. Balleine, K. Doya, J. O’Doherty, & M. Sakagami (Eds.), Reward and decision making in corticobasal ganglia networks (Annals of the New York Academy of Sciences, Vol. 1104, pp. 135–146). New York: New York Academy of Sciences.
Preuschoff, K., Bossaerts, P., & Quartz, S. R. (2006). Neural differentiation of expected reward and risk in human subcortical structures. Neuron, 51, 381–390.
Puterman, M. L. (2001). Dynamic programming. In R. A. Meyers (Ed.), Encyclopedia of physical science and technology (3rd ed., Vol. 4, pp. 673–696). San Diego: Academic Press.
Puterman, M. L. (2005). Markov decision processes: Discrete stochastic dynamic programming. Hoboken, NJ: Wiley-Interscience.
Rachlin, H., Raineri, A., & Cross, D. (1991). Subjective probability and delay. Journal of the Experimental Analysis of Behavior, 55, 233–244.
Ramm, P., Beninger, R. J., & Frost, B. J. (1984). Functional activity in the lateral habenular and dorsal raphe nuclei following administration of several dopamine receptor antagonists. Canadian Journal of Physiology & Pharmacology, 62, 1530–1533.
Redish, A. D., & Johnson, A. (2007). A computational model of craving and obsession. In B. W. Balleine, K. Doya, J. O’Doherty, & M. Sakagami (Eds.), Reward and decision making in corticobasal ganglia networks (Annals of the New York Academy of Sciences, Vol. 1104, pp. 324–339). New York: New York Academy of Sciences.
Reisine, T. D., Soubrié, P., Artaud, F., & Glowinski, J. (1982). Involvement of lateral habenula-dorsal raphe neurons in the differential regulation of striatal and nigral serotonergic transmission in cats. Journal of Neuroscience, 2, 1062–1071.
Rempel-Clower, N. L. (2007). Role of orbitofrontal cortex connections in emotion. In G. Schoenbaum, J. A. Gottfried, E. A. Murray, & S. J. Ramus (Eds.), Linking affect to action: Critical contributions of the orbitofrontal cortex (Annals of the New York Academy of Sciences, Vol. 1121, pp. 72–86). New York: New York Academy of Sciences.
Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. In A. H. Black & W. F. Prokasy (Eds.), Classical conditioning II: Current research and theory (pp. 64–99). New York: Appleton-Century-Crofts.
Reynolds, J. N., & Wickens, J. R. (2002). Dopamine-dependent plasticity of corticostriatal synapses. Neural Networks, 15, 507–521. doi:10.1016/S0893-6080(02)00045-X
Richards, J. B., Mitchell, S. H., de Wit, H., & Seiden, L. S. (1997). Determination of discount functions in rats with an adjusting-amount procedure. Journal of the Experimental Analysis of Behavior, 67, 353–366.
Rodriguez, P. F., Aron, A. R., & Poldrack, R. A. (2006). Ventralstriatal/ nucleus-accumbens sensitivity to prediction errors during classification learning. Human Brain Mapping, 27, 306–313.
Samuelson, P. (1937). A note on measurement of utility. Review of Economic Studies, 4, 155–161.
Santamaria, J. C., Sutton, R. S., & Ram, A. (1998). Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive Behavior, 6, 163–218.
Schoenbaum, G., Chiba, A. A., & Gallagher, M. (1998). Orbitofrontal cortex and basolateral amygdala encode expected outcomes during learning. Nature Neuroscience, 1, 155–159.
Schoenbaum, G., & Roesch, M. (2005). Orbitofrontal cortex, associative learning, and expectancies. Neuron, 47, 633–636.
Schönberg, T., Daw, N. D., Joel, D., & O’Doherty, J. P. (2007). Reinforcement learning signals in the human striatum distinguish learners from nonlearners during reward-based decision making. Journal of Neuroscience, 27, 12860–12867.
Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80, 1–27.
Schultz, W. (2000). Multiple reward signals in the brain. Nature Reviews Neuroscience, 1, 199–207.
Schultz, W. (2002). Getting formal with dopamine and reward. Neuron, 36, 241–263.
Schultz, W., Apicella, P., Scarnati, E., & Ljungberg, T. (1992). Neuronal activity in monkey ventral striatum related to the expectation of reward. Journal of Neuroscience, 12, 4595–4610.
Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599.
Schultz, W., & Dickinson, A. (2000). Neuronal coding of prediction errors. Annual Review of Neuroscience, 23, 473–500.
Schultz, W., Preuschoff, K., Camerer, C., Hsu, M., Fiorillo, C. D., Tobler, P. N., & Bossaerts, P. (2008). Explicit neural signals reflecting reward uncertainty. Philosophical Transactions of the Royal Society B, 363, 3801–3811. doi:10.1098/rstb.2008.0152
Schultz, W., & Romo, R. (1987). Responses of nigrostriatal dopamine neurons to high-intensity somatosensory stimulation in the anesthetized monkey. Journal of Neurophysiology, 57, 201–217.
Schultz, W., Tremblay, L., & Hollerman, J. R. (2000). Reward processing in primate orbitofrontal cortex and basal ganglia. Cerebral Cortex, 10, 272–283.
Schweimer, J. V., Brierley, D. I., & Ungless, M. A. (2008). Phasic nociceptive responses in dorsal raphe serotonin neurons. Fundamental & Clinical Pharmacology, 22, 119.
Setlow, B., Schoenbaum, G., & Gallagher, M. (2003). Neural encoding in ventral striatum during olfactory discrimination learning. Neuron, 38, 625–636.
Shohamy, D., Myers, C. E., Grossman, S., Sage, J., Gluck, M. A., & Poldrack, R. A. (2004). Cortico-striatal contributions to feedbackbased learning: Converging data from neuroimaging and neuropsychology. Brain, 127, 851–859.
Simmons, J. M., Ravel, S., Shidara, M., & Richmond, B. J. (2007). A comparison of reward-contingent neuronal activity in monkey orbitofrontal cortex and ventral striatum: Guiding actions toward rewards. In G. Schoenbaum, J. A. Gottfried, E. A. Murray, & S. J. Ramus (Eds.), Linking affect to action: Critical contributions of the orbitofrontal cortex (Annals of the New York Academy of Sciences, Vol. 1121, pp. 376–394). New York: New York Academy of Sciences.
Smart, W. D., & Kaelbling, L. P. (2000). Practical reinforcement learning in continuous spaces. In Proceedings of the 17th International Conference on Machine Learning (pp. 903–910). San Francisco: Morgan Kaufmann.
Sozou, P. D. (1998). On hyperbolic discounting and uncertain hazard rates. Proceedings of the Royal Society B, 265, 2015–2020.
Stern, W. C., Johnson, A., Bronzino, J. D., & Morgane, P. J. (1979). Effects of electrical stimulation of the lateral habenula on single-unit activity of raphe neurons. Experimental Neurology, 65, 326–342.
Stevens, S. S. (1957). On the psychophysical law. Psychological Review, 64, 153–181.
Suri, R. E. (2002). TD models of reward predictive responses in dopamine neurons. Neural Networks, 15, 523–533.
Suri, R. E., & Schultz, W. (1999). A neural network model with dopamine- like reinforcement signal that learns a spatial delayed response task. Neuroscience, 91, 871–890.
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.
Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement. In M. R. Gabriel & J. Moore (Eds.), Learning and computational neuroscience: Foundations of adaptive networks (pp. 497–537). Cambridge, MA: MIT Press.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.
Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112, 181–211.
Tan, C. O., & Bullock, D. (2008). A local circuit model of learned striatal and dopamine cell responses under probabilistic schedules of reward. Journal of Neuroscience, 28, 10062–10074.
Thiébot, M. H., Hamon, M., & Soubrié, P. (1983). The involvement of nigral serotonin innervation in the control of punishment-induced behavioral inhibition in rats. Pharmacology, Biochemistry & Behavior, 19, 225–229.
Thorndike, E. L. (1898). Animal intelligence: An experimental study of the associative processes in animals. Psychological Review Monograph Supplements, 2(4, Whole No. 8).
Tobler, P. N., Christopoulos, G. I., O’Doherty, J. P., Dolan, R. J., & Schultz, W. (2008). Neuronal distortions of reward probability without choice. Journal of Neuroscience, 28, 11703–11711.
Tobler, P. N., Fiorillo, C. D., & Schultz, W. (2005). Adaptive coding of reward value by dopamine neurons. Science, 307, 1642–1645.
Tobler, P. N., O’Doherty, J. P., Dolan, R. J., & Schultz, W. (2007). Reward value coding distinct from risk attitude-related uncertainty coding in human reward systems. Journal of Neurophysiology, 97, 1621–1632.
Tolman, E. C. (1932). Purposive behavior in animals and men. New York: Appleton Century.
Tremblay, L., & Schultz, W. (1999). Relative reward preference in primate orbitofrontal cortex. Nature, 398, 704–708.
Tremblay, L., & Schultz, W. (2000). Reward-related neuronal activity during go-nogo task performance in primate orbitofrontal cortex. Journal of Neurophysiology, 83, 1864–1876.
Trepel, C., Fox, C. R., & Poldrack, R. A. (2005). Prospect theory on the brain? Toward a cognitive neuroscience of decision under risk. Cognitive Brain Research, 23, 34–50.
Tricomi, E. M., Delgado, M. R., & Fiez, J. A. (2004). Modulation of caudate activity by action contingency. Neuron, 41, 281–292.
Tversky, A., & Kahneman, D. (1992). Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk & Uncertainty, 5, 297–323.
Tye, N. C., Everitt, B. J., & Iversen, S. D. (1977). 5- Hydroxytryptamine and punishment. Nature, 268, 741–743.
Ungless, M. A., Magill, P. J., & Bolam, J. P. (2004). Uniform inhibition of dopamine neurons in the ventral tegmental area by aversive stimuli. Science, 303, 2040–2042.
von Neumann, J., & Morgenstern, O. (1944). Theory of games and economic behavior. Princeton, NJ: Princeton University Press.
Wan, X., & Peoples, L. L. (2006). Firing patterns of accumbal neurons during a Pavlovian-conditioned approach task. Journal of Neurophysiology, 96, 652–660.
Wang, R. Y., & Aghajanian, G. K. (1977). Physiological evidence for habenula as major link between forebrain and midbrain raphe. Science, 197, 89–91.
White, N. M., & Hiroi, N. (1998). Preferential localization of selfstimulation sites in striosomes/patches in the rat striatum. Proceedings of the National Academy of Sciences, 95, 6486–6491.
Wickens, J. R., Budd, C. S., Hyland, B. I., & Arbuthnott, G. W. (2007). Striatal contributions to reward and decision making: Making sense of regional variations in a reiterated processing matrix. In B. W. Balleine, K. Doya, J. O’Doherty, & M. Sakagami (Eds.), Reward and decision making in corticobasal ganglia networks (Annals of the New York Academy of Sciences, Vol. 1104, pp. 192–212). New York: New York Academy of Sciences.
Wilkinson, L. O., & Jacobs, B. L. (1988). Lack of response of serotonergic neurons in the dorsal raphe nucleus of freely moving cats to stressful stimuli. Experimental Neurology, 101, 445–457.
Witten, I. H. (1977). An adaptive optimal controller for discrete-time Markov environments. Information & Control, 34, 286–295.
Wooten, G. F., & Collins, R. C. (1981). Metabolic effects of unilateral lesion of the substantia nigra. Journal of Neuroscience, 1, 285–291.
Yang, L.-M., Hu, B., Xia, Y.-H., Zhang, B.-L., & Zhao, H. (2008). Lateral habenula lesions improve the behavioral response in depressed rats via increasing the serotonin level in dorsal raphe nucleus. Behavioural Brain Research, 188, 84–90.
Yin, H. H., & Knowlton, B. J. (2006). The role of the basal ganglia in habit formation. Nature Reviews Neuroscience, 7, 464–476.
Zald, D. H., & Kim, S. W. (2001). The orbitofrontal cortex. In S. P. Salloway, P. F. Malloy, & J. D. Duffy (Eds.), The frontal lobes and neuropsychiatric illness (pp. 33–69). Washington, DC: American Psychiatric Publishing.
Electronic supplementary material
About this article
Cite this article
Maia, T.V. Reinforcement learning, conditioning, and the brain: Successes and challenges. Cognitive, Affective, & Behavioral Neuroscience 9, 343–364 (2009). https://doi.org/10.3758/CABN.9.4.343
- Conditioned Stimulus
- Prediction Error
- Reinforcement Learning
- Dopamine Neuron
- Markov Decision Process