Skip to main content
Log in

Stimulus sampling as an exploration mechanism for fast reinforcement learning

  • Original Paper
  • Published:
Biological Cybernetics Aims and scope Submit manuscript

Abstract

Reinforcement learning in neural networks requires a mechanism for exploring new network states in response to a single, nonspecific reward signal. Existing models have introduced synaptic or neuronal noise to drive this exploration. However, those types of noise tend to almost average out—precluding or significantly hindering learning —when coding in neuronal populations or by mean firing rates is considered. Furthermore, careful tuning is required to find the elusive balance between the often conflicting demands of speed and reliability of learning. Here we show that there is in fact no need to rely on intrinsic noise. Instead, ongoing synaptic plasticity triggered by the naturally occurring online sampling of a stimulus out of an entire stimulus set produces enough fluctuations in the synaptic efficacies for successful learning. By combining stimulus sampling with reward attenuation, we demonstrate that a simple Hebbian-like learning rule yields the performance that is very close to that of primates on visuomotor association tasks. In contrast, learning rules based on intrinsic noise (node and weight perturbation) are markedly slower. Furthermore, the performance advantage of our approach persists for more complex tasks and network architectures. We suggest that stimulus sampling and reward attenuation are two key components of a framework by which any single-cell supervised learning rule can be converted into a reinforcement learning rule for networks without requiring any intrinsic noise source.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Aggelopoulos N, Franco L, Rolls E (2005) Object perception in natural scenes: encoding by inferior temporal cortex simultaneously recorded neurons. J Neurophysiol 93: 1342–1357

    Article  PubMed  Google Scholar 

  • Asaad W, Rainer G, Miller E (1998) Neural activity in the primate prefrontal cortex during associative learning. Neuron 21: 1399–1407

    Article  CAS  PubMed  Google Scholar 

  • Barto A, Jordan M (1987) Gradient following without back-propagation in layered networks. In: Proceedings of the IEEE first annual conference on neural networks, vol 2. San Diego, pp 629–36

  • Bayer H, Glimcher P (2005) Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron 47: 129–141

    Article  CAS  PubMed  Google Scholar 

  • Bayley P, Squire L (2002) Medial temporal lobe amnesia: gradual acquisition of factual information by nondeclarative memory. J Neurosci 22: 5741–5748

    CAS  PubMed  Google Scholar 

  • Brasted P, Wise S (2004) Comparison of learning-related neuronal activity in the dorsal premotor cortex and striatum. Eur J Neurosci 19: 721–740

    Article  PubMed  Google Scholar 

  • Buckmaster C, Eichenbaum H, Amaral D, Suzuki W, Rapp P (2004) Entorhinal cortex lesions disrupt the relational organization of memory in monkeys. J Neurosci 24: 9811–9825

    Article  CAS  PubMed  Google Scholar 

  • Cahusac P, Rolls E, Miyashita Y, Niki H (1993) Modification of the responses of hippocampal neurons in the monkey during the learning of a conditional spatial response task. Hippocampus 3: 29–42

    Article  CAS  PubMed  Google Scholar 

  • Cauwenberghs G (1993) A fast stochastic error-descent algorithm for supervised learning and optimization. In: Giles C, Hanson S, Cowan J (eds) Advances in neural information processing systems, vol 5. Morgan Kaufmann, San Mateo, pp 244–251

    Google Scholar 

  • Chen L, Wise S (1995a) Neuronal activity in the supplementary eye field during acquisition of conditional oculomotor associations. J Neurophysiol 73: 1101–1121

    CAS  PubMed  Google Scholar 

  • Chen L, Wise S (1995b) Supplementary eye field contrasted with the frontal eye field during acquisition of conditional oculomotor associations. J Neurophysiol 73: 1122–1134

    CAS  PubMed  Google Scholar 

  • Chialvo D, Bak P (1999) Learning from mistakes. Neuroscience 90: 1137–1148

    Article  CAS  PubMed  Google Scholar 

  • Daw N, Doya K (2006) The computational neurobiology of learning and reward. Curr Opin Neurobiol 16: 199–204

    Article  CAS  PubMed  Google Scholar 

  • Doya K (2008) Modulators of decision making. Nat Neurosci 11: 410–416

    Article  CAS  PubMed  Google Scholar 

  • Doya K, Sejnowski T (1998) A computational model of birdsong learning by auditory experience and auditory feedback. In: Brugge J, Poon P (eds) Central auditory processing and neural modeling. Plenum Press, New York, pp 77–88

    Google Scholar 

  • Eichenbaum H (1999) Cortical-hippocampal networks for declarative memory. Nat Rev Neurosci 1: 41–50

    Article  Google Scholar 

  • Eichenbaum H, Dudchenko P, Wood E, Shapiro M, Tanila H (1999) The hippocampus, memory, and place cells: is it spatial memory or a memory space?. Neuron 23: 209–226

    Article  CAS  PubMed  Google Scholar 

  • Fiete I, Seung H (2006) Gradient learning in spiking neural networks by dynamic perturbation of conductances. Phys Rev Lett 97: 048104

    Article  PubMed  Google Scholar 

  • Flower B, Jabri M (1993) Summed weight neuron perturbation: an \( {\mathcal{O}}(n) \) improvement over weight perturbation. In: Giles C, Hanson S, Cowan J (eds) Advances in neural information processing systems, vol 5. Morgan Kaufmann, San Mateo, pp 212–19

    Google Scholar 

  • Hebb O (1949) The organization of behavior. Wiley, New York

    Google Scholar 

  • Hertz J, Krogh A, Palmer R (1991) Introduction to the theory of neural computation. Addison-Wesley, Redwood City

    Google Scholar 

  • Jabri M, Flower B (1992) Weight perturbation: an optimal architecture and learning technique for analog VLSI feedforward and recurrent multilayered networks. IEEE Trans Neural Netw 3: 154–157

    Article  CAS  PubMed  Google Scholar 

  • Kobayashi Y, Okada K (2007) Reward prediction error computation in the pedunculopontine tegmental nucleus neurons. Ann New York Acad Sci 1104: 310–323

    Article  CAS  Google Scholar 

  • Mazzoni P, Andersen R, Jordan M (1991) A more biologically plausible learning rule for neural networks. Proc Natl Acad Sci USA 88: 4433–4437

    Article  CAS  PubMed  Google Scholar 

  • McClure S, Daw N, Montague P (2003) A computational substrate for incentive salience. Trends Neurosci 26: 423–428

    Article  CAS  PubMed  Google Scholar 

  • Mitz A, Godschalk M, Wise S (1991) Learning-dependent neuronal activity in the premotor cortex: activity during the acquisition of conditional motor associations. J Neurosci 11: 1855–1872

    CAS  PubMed  Google Scholar 

  • Montague P, Dayan P, Person C, Sejnowski T (1995) Bee foraging in uncertain environments using predictive hebbian learning. Nature 377: 725–728

    Article  CAS  PubMed  Google Scholar 

  • Montague P, Dayan P, Sejnowski T (1996) A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J Neurosci 16: 1936–1947

    CAS  PubMed  Google Scholar 

  • Pasupathy A, Miller E (2005) Different time courses of learning-related activity in the prefrontal cortex and striatum. Nature 433: 873–876

    Article  CAS  PubMed  Google Scholar 

  • Rescorla R, Wagner A (1972) A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. In: Black AH, Prokasy WF (eds) Classical conditioning II: Current research and theory. Appleton-Century-Crofts, New York, pp 64–99

    Google Scholar 

  • Rolls E, Franco L, Aggelopoulos N, Jerez J (2006) Information in the first spike, the order of spikes, and the number of spikes provided by neurons in the inferior temporal visual cortex. Vis Res 46: 4193–4205

    Article  PubMed  Google Scholar 

  • Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65: 386–408

    Article  CAS  PubMed  Google Scholar 

  • Rumelhart D, Durbin R, Golden R, Chauvin Y (1996) Backpropagation: the basic theory. In: Smolensky P, Mozer M, Rumelhart D (eds) Mathematical perspectives on neural networks. Lawrence Erlbaum Associates, Hillsdale, pp 533–566

    Google Scholar 

  • Schönberg T, Daw N, Joel D, O’Doherty J (2007) Reinforcement learning signals in the human striatum distinguish learners from nonlearners during reward-based decision making. J Neurosci 27: 12860–12867

    Article  PubMed  Google Scholar 

  • Schultz W (2002) Getting formal with dopamine and reward. Neuron 36: 241–263

    Article  CAS  PubMed  Google Scholar 

  • Schultz W, Dayan P, Montague P (1997) A neural substrate of prediction and reward. Science 275: 1593–1599

    Article  CAS  PubMed  Google Scholar 

  • Schultz W, Tremblay L, Hollerman J (2003) Changes in behavior-related neuronal activity in the striatum during learning. Trends Neurosci 26: 321–328

    Article  CAS  PubMed  Google Scholar 

  • Senn W, Fusi S (2005) Convergence of stochastic learning in perceptrons with binary synapses. Phys Rev E Stat Nonlinear Soft Matter Phys 71: 061907

    Google Scholar 

  • Seung H (2003) Learning in spiking neural networks by reinforcement of stochastic synaptic transmission. Neuron 40: 1063–1073

    Article  CAS  PubMed  Google Scholar 

  • Seymour B, O’Doherty J, Dayan P, Koltzenburg M, Jones A, Dolan R, Friston K, Frackowiak R (2004) Temporal difference models describe higher-order learning in humans. Nature 429: 664–667

    Article  CAS  PubMed  Google Scholar 

  • Stark C, Bayley P, Squire L (2002) Recognition memory for single items and for associations is similarly impaired following damage to the hippocampal region. Learn Mem 9: 238–242

    Article  PubMed  Google Scholar 

  • Stark C, Squire L (2003) Hippocampal damage equally impairs memory for single items and memory for conjunctions. Hippocampus 13: 281–292

    Article  PubMed  Google Scholar 

  • Sutton R, Barto A (1981) Toward a modern theory of adaptive networks: expectation and prediction. Psychol Rev 88: 135–170

    Article  CAS  PubMed  Google Scholar 

  • Sutton R, Barto A (1998) Reinforcement learning: an introduction. MIT Press, Cambridge

    Google Scholar 

  • Suzuki W (2007) Integrating associative learning signals across the brain. Hippocampus 17: 842–850

    Article  PubMed  Google Scholar 

  • Vargha-Khadem F, Gadian D, Watkins K, Connelly A, Van Paesschen W, Mishkin M (1997) Differential effects of early hippocampal pathology on episodic and semantic memory. Science 277: 376–380

    Article  CAS  PubMed  Google Scholar 

  • Vasilaki E, Fusi S, Wang X, Senn W (2009) Learning flexible sensori-motor mappings in a complex network. Biol Cybern 100: 147–158

    Article  PubMed  Google Scholar 

  • Werfel J, Xie X, Seung H (2005) Learning curves for stochastic gradient descent in linear feedforward networks. Neural Comput 17: 2699–2718

    Article  PubMed  Google Scholar 

  • Wickens J, Horvitz J, Costa R, Killcross S (2007) Dopaminergic mechanisms in actions and habits. J Neurosci 27: 8181–8183

    Article  CAS  PubMed  Google Scholar 

  • Widrow B, Lehr M (1990) Thirty years of adaptive neural networks: perceptron, madaline, and backpropagation. Proc IEEE 78: 1415–1442

    Article  Google Scholar 

  • Williams R (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8: 229–256

    Google Scholar 

  • Williams Z, Eskandar E (2006) Selective enhancement of associative learning by microstimulation of the anterior caudate. Nat Neurosci 9: 562–568

    Article  CAS  PubMed  Google Scholar 

  • Wirth S, Yanike M, Frank L, Smith A, Brown W, Suzuki W (2003) Single neurons in the monkey hippocampus and learning of new associations. Science 300: 1578–1581

    Article  CAS  PubMed  Google Scholar 

  • Xie X, Seung H (2004) Learning in neural networks by reinforcement of irregular spiking. Phys Rev E Stat Nonlinear Soft Matter Phys 69: 041909

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Boris B. Vladimirskiy.

Additional information

This work was supported by the Swiss National Science Foundation grant K-32K0-118084.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vladimirskiy, B.B., Vasilaki, E., Urbanczik, R. et al. Stimulus sampling as an exploration mechanism for fast reinforcement learning. Biol Cybern 100, 319–330 (2009). https://doi.org/10.1007/s00422-009-0305-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00422-009-0305-x

Keywords

Navigation