Abstract
Reinforcement learning in neural networks requires a mechanism for exploring new network states in response to a single, nonspecific reward signal. Existing models have introduced synaptic or neuronal noise to drive this exploration. However, those types of noise tend to almost average out—precluding or significantly hindering learning —when coding in neuronal populations or by mean firing rates is considered. Furthermore, careful tuning is required to find the elusive balance between the often conflicting demands of speed and reliability of learning. Here we show that there is in fact no need to rely on intrinsic noise. Instead, ongoing synaptic plasticity triggered by the naturally occurring online sampling of a stimulus out of an entire stimulus set produces enough fluctuations in the synaptic efficacies for successful learning. By combining stimulus sampling with reward attenuation, we demonstrate that a simple Hebbian-like learning rule yields the performance that is very close to that of primates on visuomotor association tasks. In contrast, learning rules based on intrinsic noise (node and weight perturbation) are markedly slower. Furthermore, the performance advantage of our approach persists for more complex tasks and network architectures. We suggest that stimulus sampling and reward attenuation are two key components of a framework by which any single-cell supervised learning rule can be converted into a reinforcement learning rule for networks without requiring any intrinsic noise source.
Similar content being viewed by others
References
Aggelopoulos N, Franco L, Rolls E (2005) Object perception in natural scenes: encoding by inferior temporal cortex simultaneously recorded neurons. J Neurophysiol 93: 1342–1357
Asaad W, Rainer G, Miller E (1998) Neural activity in the primate prefrontal cortex during associative learning. Neuron 21: 1399–1407
Barto A, Jordan M (1987) Gradient following without back-propagation in layered networks. In: Proceedings of the IEEE first annual conference on neural networks, vol 2. San Diego, pp 629–36
Bayer H, Glimcher P (2005) Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron 47: 129–141
Bayley P, Squire L (2002) Medial temporal lobe amnesia: gradual acquisition of factual information by nondeclarative memory. J Neurosci 22: 5741–5748
Brasted P, Wise S (2004) Comparison of learning-related neuronal activity in the dorsal premotor cortex and striatum. Eur J Neurosci 19: 721–740
Buckmaster C, Eichenbaum H, Amaral D, Suzuki W, Rapp P (2004) Entorhinal cortex lesions disrupt the relational organization of memory in monkeys. J Neurosci 24: 9811–9825
Cahusac P, Rolls E, Miyashita Y, Niki H (1993) Modification of the responses of hippocampal neurons in the monkey during the learning of a conditional spatial response task. Hippocampus 3: 29–42
Cauwenberghs G (1993) A fast stochastic error-descent algorithm for supervised learning and optimization. In: Giles C, Hanson S, Cowan J (eds) Advances in neural information processing systems, vol 5. Morgan Kaufmann, San Mateo, pp 244–251
Chen L, Wise S (1995a) Neuronal activity in the supplementary eye field during acquisition of conditional oculomotor associations. J Neurophysiol 73: 1101–1121
Chen L, Wise S (1995b) Supplementary eye field contrasted with the frontal eye field during acquisition of conditional oculomotor associations. J Neurophysiol 73: 1122–1134
Chialvo D, Bak P (1999) Learning from mistakes. Neuroscience 90: 1137–1148
Daw N, Doya K (2006) The computational neurobiology of learning and reward. Curr Opin Neurobiol 16: 199–204
Doya K (2008) Modulators of decision making. Nat Neurosci 11: 410–416
Doya K, Sejnowski T (1998) A computational model of birdsong learning by auditory experience and auditory feedback. In: Brugge J, Poon P (eds) Central auditory processing and neural modeling. Plenum Press, New York, pp 77–88
Eichenbaum H (1999) Cortical-hippocampal networks for declarative memory. Nat Rev Neurosci 1: 41–50
Eichenbaum H, Dudchenko P, Wood E, Shapiro M, Tanila H (1999) The hippocampus, memory, and place cells: is it spatial memory or a memory space?. Neuron 23: 209–226
Fiete I, Seung H (2006) Gradient learning in spiking neural networks by dynamic perturbation of conductances. Phys Rev Lett 97: 048104
Flower B, Jabri M (1993) Summed weight neuron perturbation: an \( {\mathcal{O}}(n) \) improvement over weight perturbation. In: Giles C, Hanson S, Cowan J (eds) Advances in neural information processing systems, vol 5. Morgan Kaufmann, San Mateo, pp 212–19
Hebb O (1949) The organization of behavior. Wiley, New York
Hertz J, Krogh A, Palmer R (1991) Introduction to the theory of neural computation. Addison-Wesley, Redwood City
Jabri M, Flower B (1992) Weight perturbation: an optimal architecture and learning technique for analog VLSI feedforward and recurrent multilayered networks. IEEE Trans Neural Netw 3: 154–157
Kobayashi Y, Okada K (2007) Reward prediction error computation in the pedunculopontine tegmental nucleus neurons. Ann New York Acad Sci 1104: 310–323
Mazzoni P, Andersen R, Jordan M (1991) A more biologically plausible learning rule for neural networks. Proc Natl Acad Sci USA 88: 4433–4437
McClure S, Daw N, Montague P (2003) A computational substrate for incentive salience. Trends Neurosci 26: 423–428
Mitz A, Godschalk M, Wise S (1991) Learning-dependent neuronal activity in the premotor cortex: activity during the acquisition of conditional motor associations. J Neurosci 11: 1855–1872
Montague P, Dayan P, Person C, Sejnowski T (1995) Bee foraging in uncertain environments using predictive hebbian learning. Nature 377: 725–728
Montague P, Dayan P, Sejnowski T (1996) A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J Neurosci 16: 1936–1947
Pasupathy A, Miller E (2005) Different time courses of learning-related activity in the prefrontal cortex and striatum. Nature 433: 873–876
Rescorla R, Wagner A (1972) A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. In: Black AH, Prokasy WF (eds) Classical conditioning II: Current research and theory. Appleton-Century-Crofts, New York, pp 64–99
Rolls E, Franco L, Aggelopoulos N, Jerez J (2006) Information in the first spike, the order of spikes, and the number of spikes provided by neurons in the inferior temporal visual cortex. Vis Res 46: 4193–4205
Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65: 386–408
Rumelhart D, Durbin R, Golden R, Chauvin Y (1996) Backpropagation: the basic theory. In: Smolensky P, Mozer M, Rumelhart D (eds) Mathematical perspectives on neural networks. Lawrence Erlbaum Associates, Hillsdale, pp 533–566
Schönberg T, Daw N, Joel D, O’Doherty J (2007) Reinforcement learning signals in the human striatum distinguish learners from nonlearners during reward-based decision making. J Neurosci 27: 12860–12867
Schultz W (2002) Getting formal with dopamine and reward. Neuron 36: 241–263
Schultz W, Dayan P, Montague P (1997) A neural substrate of prediction and reward. Science 275: 1593–1599
Schultz W, Tremblay L, Hollerman J (2003) Changes in behavior-related neuronal activity in the striatum during learning. Trends Neurosci 26: 321–328
Senn W, Fusi S (2005) Convergence of stochastic learning in perceptrons with binary synapses. Phys Rev E Stat Nonlinear Soft Matter Phys 71: 061907
Seung H (2003) Learning in spiking neural networks by reinforcement of stochastic synaptic transmission. Neuron 40: 1063–1073
Seymour B, O’Doherty J, Dayan P, Koltzenburg M, Jones A, Dolan R, Friston K, Frackowiak R (2004) Temporal difference models describe higher-order learning in humans. Nature 429: 664–667
Stark C, Bayley P, Squire L (2002) Recognition memory for single items and for associations is similarly impaired following damage to the hippocampal region. Learn Mem 9: 238–242
Stark C, Squire L (2003) Hippocampal damage equally impairs memory for single items and memory for conjunctions. Hippocampus 13: 281–292
Sutton R, Barto A (1981) Toward a modern theory of adaptive networks: expectation and prediction. Psychol Rev 88: 135–170
Sutton R, Barto A (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
Suzuki W (2007) Integrating associative learning signals across the brain. Hippocampus 17: 842–850
Vargha-Khadem F, Gadian D, Watkins K, Connelly A, Van Paesschen W, Mishkin M (1997) Differential effects of early hippocampal pathology on episodic and semantic memory. Science 277: 376–380
Vasilaki E, Fusi S, Wang X, Senn W (2009) Learning flexible sensori-motor mappings in a complex network. Biol Cybern 100: 147–158
Werfel J, Xie X, Seung H (2005) Learning curves for stochastic gradient descent in linear feedforward networks. Neural Comput 17: 2699–2718
Wickens J, Horvitz J, Costa R, Killcross S (2007) Dopaminergic mechanisms in actions and habits. J Neurosci 27: 8181–8183
Widrow B, Lehr M (1990) Thirty years of adaptive neural networks: perceptron, madaline, and backpropagation. Proc IEEE 78: 1415–1442
Williams R (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8: 229–256
Williams Z, Eskandar E (2006) Selective enhancement of associative learning by microstimulation of the anterior caudate. Nat Neurosci 9: 562–568
Wirth S, Yanike M, Frank L, Smith A, Brown W, Suzuki W (2003) Single neurons in the monkey hippocampus and learning of new associations. Science 300: 1578–1581
Xie X, Seung H (2004) Learning in neural networks by reinforcement of irregular spiking. Phys Rev E Stat Nonlinear Soft Matter Phys 69: 041909
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported by the Swiss National Science Foundation grant K-32K0-118084.
Rights and permissions
About this article
Cite this article
Vladimirskiy, B.B., Vasilaki, E., Urbanczik, R. et al. Stimulus sampling as an exploration mechanism for fast reinforcement learning. Biol Cybern 100, 319–330 (2009). https://doi.org/10.1007/s00422-009-0305-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00422-009-0305-x