Stimulus sampling as an exploration mechanism for fast reinforcement learning

Vladimirskiy, Boris B.; Vasilaki, Eleni; Urbanczik, Robert; Senn, Walter

doi:10.1007/s00422-009-0305-x

Stimulus sampling as an exploration mechanism for fast reinforcement learning

Original Paper
Published: 10 April 2009

Volume 100, pages 319–330, (2009)
Cite this article

Biological Cybernetics Aims and scope Submit manuscript

Boris B. Vladimirskiy¹,
Eleni Vasilaki²,
Robert Urbanczik¹ &
…
Walter Senn¹

104 Accesses
2 Citations
Explore all metrics

Abstract

Reinforcement learning in neural networks requires a mechanism for exploring new network states in response to a single, nonspecific reward signal. Existing models have introduced synaptic or neuronal noise to drive this exploration. However, those types of noise tend to almost average out—precluding or significantly hindering learning —when coding in neuronal populations or by mean firing rates is considered. Furthermore, careful tuning is required to find the elusive balance between the often conflicting demands of speed and reliability of learning. Here we show that there is in fact no need to rely on intrinsic noise. Instead, ongoing synaptic plasticity triggered by the naturally occurring online sampling of a stimulus out of an entire stimulus set produces enough fluctuations in the synaptic efficacies for successful learning. By combining stimulus sampling with reward attenuation, we demonstrate that a simple Hebbian-like learning rule yields the performance that is very close to that of primates on visuomotor association tasks. In contrast, learning rules based on intrinsic noise (node and weight perturbation) are markedly slower. Furthermore, the performance advantage of our approach persists for more complex tasks and network architectures. We suggest that stimulus sampling and reward attenuation are two key components of a framework by which any single-cell supervised learning rule can be converted into a reinforcement learning rule for networks without requiring any intrinsic noise source.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The influence of attention and reward on the learning of stimulus-response associations

Article Open access 22 August 2017

Targeted V1 comodulation supports task-adaptive sensory decisions

Article Open access 30 November 2023

Modulation of Dopamine for Adaptive Learning: a Neurocomputational Model

Article 12 June 2020

References

Aggelopoulos N, Franco L, Rolls E (2005) Object perception in natural scenes: encoding by inferior temporal cortex simultaneously recorded neurons. J Neurophysiol 93: 1342–1357
Article PubMed Google Scholar
Asaad W, Rainer G, Miller E (1998) Neural activity in the primate prefrontal cortex during associative learning. Neuron 21: 1399–1407
Article CAS PubMed Google Scholar
Barto A, Jordan M (1987) Gradient following without back-propagation in layered networks. In: Proceedings of the IEEE first annual conference on neural networks, vol 2. San Diego, pp 629–36
Bayer H, Glimcher P (2005) Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron 47: 129–141
Article CAS PubMed Google Scholar
Bayley P, Squire L (2002) Medial temporal lobe amnesia: gradual acquisition of factual information by nondeclarative memory. J Neurosci 22: 5741–5748
CAS PubMed Google Scholar
Brasted P, Wise S (2004) Comparison of learning-related neuronal activity in the dorsal premotor cortex and striatum. Eur J Neurosci 19: 721–740
Article PubMed Google Scholar
Buckmaster C, Eichenbaum H, Amaral D, Suzuki W, Rapp P (2004) Entorhinal cortex lesions disrupt the relational organization of memory in monkeys. J Neurosci 24: 9811–9825
Article CAS PubMed Google Scholar
Cahusac P, Rolls E, Miyashita Y, Niki H (1993) Modification of the responses of hippocampal neurons in the monkey during the learning of a conditional spatial response task. Hippocampus 3: 29–42
Article CAS PubMed Google Scholar
Cauwenberghs G (1993) A fast stochastic error-descent algorithm for supervised learning and optimization. In: Giles C, Hanson S, Cowan J (eds) Advances in neural information processing systems, vol 5. Morgan Kaufmann, San Mateo, pp 244–251
Google Scholar
Chen L, Wise S (1995a) Neuronal activity in the supplementary eye field during acquisition of conditional oculomotor associations. J Neurophysiol 73: 1101–1121
CAS PubMed Google Scholar
Chen L, Wise S (1995b) Supplementary eye field contrasted with the frontal eye field during acquisition of conditional oculomotor associations. J Neurophysiol 73: 1122–1134
CAS PubMed Google Scholar
Chialvo D, Bak P (1999) Learning from mistakes. Neuroscience 90: 1137–1148
Article CAS PubMed Google Scholar
Daw N, Doya K (2006) The computational neurobiology of learning and reward. Curr Opin Neurobiol 16: 199–204
Article CAS PubMed Google Scholar
Doya K (2008) Modulators of decision making. Nat Neurosci 11: 410–416
Article CAS PubMed Google Scholar
Doya K, Sejnowski T (1998) A computational model of birdsong learning by auditory experience and auditory feedback. In: Brugge J, Poon P (eds) Central auditory processing and neural modeling. Plenum Press, New York, pp 77–88
Google Scholar
Eichenbaum H (1999) Cortical-hippocampal networks for declarative memory. Nat Rev Neurosci 1: 41–50
Article Google Scholar
Eichenbaum H, Dudchenko P, Wood E, Shapiro M, Tanila H (1999) The hippocampus, memory, and place cells: is it spatial memory or a memory space?. Neuron 23: 209–226
Article CAS PubMed Google Scholar
Fiete I, Seung H (2006) Gradient learning in spiking neural networks by dynamic perturbation of conductances. Phys Rev Lett 97: 048104
Article PubMed Google Scholar
Flower B, Jabri M (1993) Summed weight neuron perturbation: an \( {\mathcal{O}}(n) \) improvement over weight perturbation. In: Giles C, Hanson S, Cowan J (eds) Advances in neural information processing systems, vol 5. Morgan Kaufmann, San Mateo, pp 212–19
Google Scholar
Hebb O (1949) The organization of behavior. Wiley, New York
Google Scholar
Hertz J, Krogh A, Palmer R (1991) Introduction to the theory of neural computation. Addison-Wesley, Redwood City
Google Scholar
Jabri M, Flower B (1992) Weight perturbation: an optimal architecture and learning technique for analog VLSI feedforward and recurrent multilayered networks. IEEE Trans Neural Netw 3: 154–157
Article CAS PubMed Google Scholar
Kobayashi Y, Okada K (2007) Reward prediction error computation in the pedunculopontine tegmental nucleus neurons. Ann New York Acad Sci 1104: 310–323
Article CAS Google Scholar
Mazzoni P, Andersen R, Jordan M (1991) A more biologically plausible learning rule for neural networks. Proc Natl Acad Sci USA 88: 4433–4437
Article CAS PubMed Google Scholar
McClure S, Daw N, Montague P (2003) A computational substrate for incentive salience. Trends Neurosci 26: 423–428
Article CAS PubMed Google Scholar
Mitz A, Godschalk M, Wise S (1991) Learning-dependent neuronal activity in the premotor cortex: activity during the acquisition of conditional motor associations. J Neurosci 11: 1855–1872
CAS PubMed Google Scholar
Montague P, Dayan P, Person C, Sejnowski T (1995) Bee foraging in uncertain environments using predictive hebbian learning. Nature 377: 725–728
Article CAS PubMed Google Scholar
Montague P, Dayan P, Sejnowski T (1996) A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J Neurosci 16: 1936–1947
CAS PubMed Google Scholar
Pasupathy A, Miller E (2005) Different time courses of learning-related activity in the prefrontal cortex and striatum. Nature 433: 873–876
Article CAS PubMed Google Scholar
Rescorla R, Wagner A (1972) A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. In: Black AH, Prokasy WF (eds) Classical conditioning II: Current research and theory. Appleton-Century-Crofts, New York, pp 64–99
Google Scholar
Rolls E, Franco L, Aggelopoulos N, Jerez J (2006) Information in the first spike, the order of spikes, and the number of spikes provided by neurons in the inferior temporal visual cortex. Vis Res 46: 4193–4205
Article PubMed Google Scholar
Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65: 386–408
Article CAS PubMed Google Scholar
Rumelhart D, Durbin R, Golden R, Chauvin Y (1996) Backpropagation: the basic theory. In: Smolensky P, Mozer M, Rumelhart D (eds) Mathematical perspectives on neural networks. Lawrence Erlbaum Associates, Hillsdale, pp 533–566
Google Scholar
Schönberg T, Daw N, Joel D, O’Doherty J (2007) Reinforcement learning signals in the human striatum distinguish learners from nonlearners during reward-based decision making. J Neurosci 27: 12860–12867
Article PubMed Google Scholar
Schultz W (2002) Getting formal with dopamine and reward. Neuron 36: 241–263
Article CAS PubMed Google Scholar
Schultz W, Dayan P, Montague P (1997) A neural substrate of prediction and reward. Science 275: 1593–1599
Article CAS PubMed Google Scholar
Schultz W, Tremblay L, Hollerman J (2003) Changes in behavior-related neuronal activity in the striatum during learning. Trends Neurosci 26: 321–328
Article CAS PubMed Google Scholar
Senn W, Fusi S (2005) Convergence of stochastic learning in perceptrons with binary synapses. Phys Rev E Stat Nonlinear Soft Matter Phys 71: 061907
Google Scholar
Seung H (2003) Learning in spiking neural networks by reinforcement of stochastic synaptic transmission. Neuron 40: 1063–1073
Article CAS PubMed Google Scholar
Seymour B, O’Doherty J, Dayan P, Koltzenburg M, Jones A, Dolan R, Friston K, Frackowiak R (2004) Temporal difference models describe higher-order learning in humans. Nature 429: 664–667
Article CAS PubMed Google Scholar
Stark C, Bayley P, Squire L (2002) Recognition memory for single items and for associations is similarly impaired following damage to the hippocampal region. Learn Mem 9: 238–242
Article PubMed Google Scholar
Stark C, Squire L (2003) Hippocampal damage equally impairs memory for single items and memory for conjunctions. Hippocampus 13: 281–292
Article PubMed Google Scholar
Sutton R, Barto A (1981) Toward a modern theory of adaptive networks: expectation and prediction. Psychol Rev 88: 135–170
Article CAS PubMed Google Scholar
Sutton R, Barto A (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
Google Scholar
Suzuki W (2007) Integrating associative learning signals across the brain. Hippocampus 17: 842–850
Article PubMed Google Scholar
Vargha-Khadem F, Gadian D, Watkins K, Connelly A, Van Paesschen W, Mishkin M (1997) Differential effects of early hippocampal pathology on episodic and semantic memory. Science 277: 376–380
Article CAS PubMed Google Scholar
Vasilaki E, Fusi S, Wang X, Senn W (2009) Learning flexible sensori-motor mappings in a complex network. Biol Cybern 100: 147–158
Article PubMed Google Scholar
Werfel J, Xie X, Seung H (2005) Learning curves for stochastic gradient descent in linear feedforward networks. Neural Comput 17: 2699–2718
Article PubMed Google Scholar
Wickens J, Horvitz J, Costa R, Killcross S (2007) Dopaminergic mechanisms in actions and habits. J Neurosci 27: 8181–8183
Article CAS PubMed Google Scholar
Widrow B, Lehr M (1990) Thirty years of adaptive neural networks: perceptron, madaline, and backpropagation. Proc IEEE 78: 1415–1442
Article Google Scholar
Williams R (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8: 229–256
Google Scholar
Williams Z, Eskandar E (2006) Selective enhancement of associative learning by microstimulation of the anterior caudate. Nat Neurosci 9: 562–568
Article CAS PubMed Google Scholar
Wirth S, Yanike M, Frank L, Smith A, Brown W, Suzuki W (2003) Single neurons in the monkey hippocampus and learning of new associations. Science 300: 1578–1581
Article CAS PubMed Google Scholar
Xie X, Seung H (2004) Learning in neural networks by reinforcement of irregular spiking. Phys Rev E Stat Nonlinear Soft Matter Phys 69: 041909
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Physiology, University of Bern, Bühlplatz 5, 3012, Bern, Switzerland
Boris B. Vladimirskiy, Robert Urbanczik & Walter Senn
Laboratory of Computational Neuroscience (IC/LCN), Ecole Polytechnique Fédérale de Lausanne, Station 15, 1015, Lausanne, Switzerland
Eleni Vasilaki

Authors

Boris B. Vladimirskiy
View author publications
You can also search for this author in PubMed Google Scholar
Eleni Vasilaki
View author publications
You can also search for this author in PubMed Google Scholar
Robert Urbanczik
View author publications
You can also search for this author in PubMed Google Scholar
Walter Senn
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Boris B. Vladimirskiy.

Additional information

This work was supported by the Swiss National Science Foundation grant K-32K0-118084.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vladimirskiy, B.B., Vasilaki, E., Urbanczik, R. et al. Stimulus sampling as an exploration mechanism for fast reinforcement learning. Biol Cybern 100, 319–330 (2009). https://doi.org/10.1007/s00422-009-0305-x

Download citation

Received: 05 March 2009
Accepted: 18 March 2009
Published: 10 April 2009
Issue Date: April 2009
DOI: https://doi.org/10.1007/s00422-009-0305-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stimulus sampling as an exploration mechanism for fast reinforcement learning

Abstract

Access this article

Similar content being viewed by others

The influence of attention and reward on the learning of stimulus-response associations

Targeted V1 comodulation supports task-adaptive sensory decisions

Modulation of Dopamine for Adaptive Learning: a Neurocomputational Model

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Stimulus sampling as an exploration mechanism for fast reinforcement learning

Abstract

Access this article

Similar content being viewed by others

The influence of attention and reward on the learning of stimulus-response associations

Targeted V1 comodulation supports task-adaptive sensory decisions

Modulation of Dopamine for Adaptive Learning: a Neurocomputational Model

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation