Abstract
Asynchrony, overlaps, and delays in sensory–motor signals introduce ambiguity as to which stimuli, actions, and rewards are causally related. Only the repetition of reward episodes helps distinguish true cause–effect relationships from coincidental occurrences. In the model proposed here, a novel plasticity rule employs short- and long-term changes to evaluate hypotheses on cause–effect relationships. Transient weights represent hypotheses that are consolidated in long-term memory only when they consistently predict or cause future rewards. The main objective of the model is to preserve existing network topologies when learning with ambiguous information flows. Learning is also improved by biasing the exploration of the stimulus–response space toward actions that in the past occurred before rewards. The model indicates under which conditions beliefs can be consolidated in long-term memory, it suggests a solution to the plasticity–stability dilemma, and proposes an interpretation of the role of short-term plasticity.
Similar content being viewed by others
Notes
In that case, it is essential that the traces \(E\) are bound to positive values: negative traces that multiply with the negative baseline modulation would lead to unwanted weight increase.
The exact increment depends on the learning rate, on the exact circumstantial delay between activity and reward, and on the intensity of the stochastic reward.
References
Abbott LF, Regehr WG (2004) Synaptic computation. Nature 431:796–803
Abraham WC (2008) Metaplasticity: tuning synapses and networks for plasticity. Nat Rev Neurosci 9:387–399
Abraham WC, Bear MF (1996) Metaplasticity: the plasticity of synaptic plasticity. Trends Neurosci 19:126–130
Abraham WC, Robins A (2005) Memory retention—the synaptic stability versus plasticity dilemma. Trends Neurosci 28:73–78
Alexander WH, Sporns O (2002) An embodied model of learning, plasticity, and reward. Adapt Behav 10(3–4):143–159
Asada M, Hosoda K, Kuniyoshi Y, Ishiguro H, Inui T, Yoshikawa Y, Ogino M, Yoshida C (2009) Cognitive developmental robotics: a survey. IEEE Trans Auton Mental Dev 1(1):12–34
Bailey CH, Giustetto M, Huang YY, Hawkins RD, Kandel ER (2000) Is heterosynaptic modulation essential for stabilizing Hebbian plasticity and memory? Nat Rev Neurosci 1(1):11–20
Baras D, Meir R (2007) Reinforcement learning, spike-time-dependent plasticity, and the BCM rule. Neural Comput 19(8):2245–2279
Ben-Gal I (2007) Bayesian networks. In: Encyclopedia of statistics in quality and reliability. Wiley, London
Berridge KC (2007) The debate over dopamine’s role in reward: the case for incentive salience. Psychopharmacology 191:391–431
Bosman R, van Leeuwen W, Wemmenhove B (2004) Combining Hebbian and reinforcement learning in a minibrain model. Neural Netw 17:29–36
Bouton ME (1994) Conditioning, remembering, and forgetting. J Exp Psychol Anim Behav Process 20(3):219
Bouton ME (2000) A learning theory perspective on lapse, relapse, and the maintenance of behavior change. Health Psychol 19(1S):57
Bouton ME (2004) Context and behavioral processes in extinction. Learn Mem 11(5):485–494
Bouton ME, Moody EW (2004) Memory processes in classical conditioning. Neurosci Biobehav Rev 28(7):663–674
Brembs B (2003) Operant conditioning in invertebrates. Curr Opin Neurobiol 13(6):710–717
Brembs B, Lorenzetti FD, Reyes FD, Baxter DA, Byrne JH (2002) Operant reward learning in Aplysia: neuronal correlates and mechanisms. Science 296(5573):1706–1709
Clopath C, Ziegler L, Vasilaki E, Büsing L, Gerstner W (2008) Tag-trigger-consolidation: a model of early and late long-term-potentiation and depression. PLoS Comput Biol 4(12):335–347
Cox RB, Krichmar JL (2009) Neuromodulation as a robot controller: a brain inspired strategy for controlling autonomous robots. IEEE Robot Autom Mag 16(3):72–80
Deco G, Rolls ET (2005) Synaptic and spiking dynamics underlying reward reversal in the orbitofrontal cortex. Cereb Cortex 15:15–30
Dudai Y (2004) The neurobiology of consolidations, or, how stable is the engram? Annu Rev Psychol 55:51–86
Farries MA, Fairhall AL (2007) Reinforcement learning with modulated spike timing-dependent synaptic plasticity. J Neurophysiol 98:3648–3665
Fisher SA, Fischer TM, Carew TJ (1997) Multiple overlapping processes underlying short-term synaptic enhancement. Trends Neurosci 20(4):170–177
Florian RV (2007) Reinforcement learning through modulation of spike-timing-dependent synaptic plasticity. Neural Comput 19:1468–1502
Frémaux N, Sprekeler H, Gerstner W (2010) Functional requirements for reward-modulated spike-timing-dependent plasticity. J Neurosci 30(40):13,326–13,337
Frey U, Morris RGM (1997) Synaptic tagging and long-term potentiation. Nature 385:533–536
Friedrich J, Urbanczik R, Senn W (2010) Learning spike-based population codes by reward and population feedback. Neural Comput 22:1698–1717
Friedrich J, Urbanczik R, Senn W (2011) Spatio-temporal credit assignment in neuronal population learning. PLoS Comput Biol 7(6):1–13
Fusi S, Senn W (2006) Eluding oblivion with smart stochastic selection of synaptic updates. Chaos An Interdiscip J Nonlinear Sci 16(2):026,112
Fusi S, Drew PJ, Abbott L (2005) Cascade models of synaptically stored memories. Neuron 45(4):599–611
Fusi S, Asaad WF, Miller EK, Wang XJ (2007) A neural circuit model of flexible sensorimotor mapping: learning and forgetting on multiple timescales. Neuron 54(2):319–333
Garris P, Ciolkowski E, Pastore P, Wighmann R (1994) Efflux of dopamine from the synaptic cleft in the nucleus accumbens of the rat brain. J Neurosci 14(10):6084–6093
Gerstner W (2010) From Hebb rules to spike-timing-dependent plasticity: a personal account. Front Synaptic Neurosci 2:1–3
Gil M, DeMarco RJ, Menzel R (2007) Learning reward expectations in honeybees. Learn Mem 14:291–496
Goelet P, Castellucci VF, Schacher S, Kandel ER (1986) The long and the short of long-term memory—a molecular framework. Nature 322(6078):419–422
Grossberg S (1971) On the dynamics of operant conditioning. J Theor Biol 33(2):225–255
Grossberg S (1988) Nonlinear neural networks: principles, mechanisms, and architectures. Neural Netw 1:17–61
Hamilton RH, Pascual-Leone A (1998) Cortical plasticity associated with braille learning. Trends Cogn Sci 2(5):168–174
Hammer M, Menzel R (1995) Learning and memory in the honeybee. J Neurosci 15(3):1617–1630
Heckerman D, Geiger D, Chickering DM (1995) Learning bayesian networks: the combination of knowledge and statistical data. Mach Learn 20:197–243
Howson C, Urbach P (1989) Scientific reasoning: the Bayesian approach. Open Court Publishing Co, Chicago, USA
Hull CL (1943) Principles of behavior. Appleton Century, New York
Izhikevich EM (2007) Solving the distal reward problem through linkage of STDP and dopamine signaling. Cereb Cortex 17:2443–2452
Jay MT (2003) Dopamine: a potential substrate for synaptic plasticity and memory mechanisms. Prog Neurobiol 69(6):375–390
Jonides J, Lewis RL, Nee DE, Lustig CA, Berman MG, Moore KS (2008) The mind and brain of short-term memory. Annu Rev Psychol 59:193
Kempter R, Gerstner W, Van Hemmen JL (1999) Hebbian learning and spiking neurons. Phys Rev E 59(4):4498–4514
Krichmar JL, Roehrbein F (2013) Value and reward based learning in neurorobots. Front Neurorobot 7(13):1–2
Lamprecht R, LeDoux J (2004) Structural plasticity and memory. Nat Rev Neurosci 5(1):45–54
Legenstein R, Chase SM, Schwartz A, Maass W (2010) A reward-modulated Hebbian learning rule can explain experimentally observed network reorganization in a brain control task. J Neurosci 30(25):8400–8401
Leibold C, Kempter R (2008) Sparseness constrains the prolongation of memory lifetime via synaptic metaplasticity. Cereb Cortex 18(1):67–77
Lin LJ (1993) Reinforcement learning for robots using neural networks. Ph.D. thesis, School of Computer Science. Carnegie Mellon University
Lungarella M, Metta G, Pfeifer R, Sandini G (2003) Developmental robotics: a survey. Connect Sci 15(4):151–190
Lynch MA (2004) Long-term potentiation and memory. Physiol Rev 84(1):87–136
Mayford M, Siegelbaum SA, Kandel ER (2012) Synapses and memory storage. Cold Spring Harbor Perspect Biol 4(6):a005,751
McGaugh JL (2000) Memory—a century of consolidation. Science 287:248–251
Menzel R, Müller U (1996) Learning and memory in honeybees: from behavior to natural substrates. Annu Rev Neurosci 19:179–404
Montague PR, Dayan P, Person C, Sejnowski TJ (1995) Bee foraging in uncertain environments using predictive Hebbian learning. Nature 377:725–728
Nguyen PV, Abel T, Kandel ER (1994) Requirement of a critical period of transcription for induction of a late phase of ltp. Science 265(5175):1104–1107
Nitz DA, Kargo WJ, Fleisher J (2007) Dopamine signaling and the distal reward problem. Learn Mem 18(17):1833–1836
O’Brien MJ, Srinivasan N (2013) A spiking neural model for stable reinforcement of synapses based on multiple distal rewards. Neural Comput 25(1):123–156
O’Doherty JP, Kringelbach ML, Rolls ET, Andrews C (2001) Abstract reward and punishment representations in the human orbitofrontal cortex. Nat Neurosci 4(1):95–102
Ono K (1987) Superstitious behavior in humans. J Exp Anal Behav 47(3):261–271
Päpper M, Kempter R, Leibold C (2011) Synaptic tagging, evaluation of memories, and the distal reward problem. Learn Mem 18:58–70
Pennartz CMA (1996) The ascending neuromodulatory systems in learning by reinforcement: comparing computational conjectures with experimental findings. Brain Res Rev 21:219–245
Pennartz CMA (1997) Reinforcement learning by hebbian synapses with adaptive threshold. Neuroscience 81(2):303–319
Redgrave P, Gurney K, Reynolds J (2008) What is reinforced by phasic dopamine signals? Brain Res Rev 58:322–339
Robins A (1995) Catastrophic forgetting, rehearsal, and pseudorehearsal. Connect Sci J Neural Comput Artif Intell Cogn Res 7:123–146
Sandberg A, Tegnér J, Lansner A (2003) A working memory model based on fast hebbian learning. Netw Comput Neural Syst 14(4):789–802
Sarkisov DV, Wang SSH (2008) Order-dependent coincidence detection in cerebellar Purkinje neurons at the inositol trisphosphate receptor. J Neurosci 28(1):133–142
Schultz W (1998) Predictive reward signal of dopamine neurons. J Neurophysiol 80:1–27
Schultz W, Apicella P, Ljungberg T (1993) Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task. J Neurosci 13:900–913
Schultz W, Dayan P, Montague PR (1997) A neural substrate for prediction and reward. Science 275:1593–1598
Senn W, Fusi S (2005) Learning only when necessary: better memories of correlated patterns in networks with bounded synapses. Neural Comput 17(10):2106–2138
Skinner BF (1948) “Superstition” in the pigeon. J Exp Psychol 38:168–172
Skinner BF (1953) Science and human behavior. MacMillan, New York
Soltoggio A, Stanley KO (2012) From modulated Hebbian plasticity to simple behavior learning through noise and weight saturation. Neural Netw 34:28–41
Soltoggio A, Steil JJ (2013) Solving the distal reward problem with rare correlations. Neural Comput 25(4):940–978
Soltoggio A, Bullinaria JA, Mattiussi C, Dürr P, Floreano D (2008) Evolutionary advantages of neuromodulated plasticity in dynamic, reward-based scenarios. In: Artificial life XI: proceedings of the eleventh international conference on the simulation and synthesis of living systems. MIT Press, Cambridge
Soltoggio A, Lemme A, Reinhart FR, Steil JJ (2013a) Rare neural correlations implement robotic conditioning with reward delays and disturbances. Front Neurorobot 7:1–16 (Research Topic: Value and Reward Based Learning in Neurobots)
Soltoggio A, Reinhart FR, Lemme A, Steil JJ (2013b) Learning the rules of a game: neural conditioning in human–robot interaction with delayed rewards. In: Proceedings of the third joint IEEE international conference on development and learning and on epigenetic robotics, Osaka, Japan
Sporns O, Alexander WH (2002) Neuromodulation and plasticity in an autonomous robot. Neural Netw 15:761–774
Sporns O, Alexander WH (2003) Neuromodulation in a learning robot: interactions between neural plasticity and behavior. Proc Int Joint Conf Neural Netw 4:2789–2794
Staubli U, Fraser D, Faraday R, Lynch G (1987) Olfaction and the “data” memory system in rats. Behav Neurosci 101(6):757–765
Sutton RS (1984) Temporal credit assignment in reinforcement learning. Ph.D. thesis, Department of Computer Science, University of Massachusetts, Amherst, MA 01003
Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge, MA, USA
Swartzentruber D (1995) Modulatory mechanisms in pavlovian conditioning. Anim Learn Behav 23(2):123–143
Thorndike EL (1911) Animal intelligence. Macmillan, New York
Urbanczik R, Senn W (2009) Reinforcement learning in populations of spiking neurons. Nat Neurosci 12:250–252
Van Hemmen J (1997) Hebbian learning, its correlation catastrophe, and unlearning. Netw Comput Neural Syst 8(3):V1–V17
Wang SSH, Denk W, Häusser M (2000) Coincidence detection in single dendritic spines mediated by calcium release. Nat Neurosci 3(12):1266–1273
Weng J, McClelland J, Pentland A, Sporns O, Stockman I, Sur M, Thelen E (2001) Autonomous mental development by robots and animals. Science 291(5504):599–600
Wighmann R, Zimmerman J (1990) Control of dopamine extracellular concentration in rat striatum by impulse flow and uptake. Brain Res Brain Res Rev 15(2):135–144
Wise RA, Rompre PP (1989) Brain dopamine and reward. Annu Rev Psychol 40:191–225
Xie X, Seung HS (2004) Learning in neural networks by reinforcement of irregular spiking. Phys Rev E 69:1–10
Ziemke T, Thieme M (2002) Neuromodulation of reactive sensorimotor mappings as short-term memory mechanism in delayed response tasks. Adapt Behav 10:185–199
Zucker RS (1989) Short-term synaptic plasticity. Annu Rev Neurosci 12(1):13–31
Zucker RS, Regehr WG (2002) Short-term synaptic plasticity. Annu Rev Physiol 64(1):355–405
Acknowledgments
The author thanks William Land, Albert Mukovskiy, Kenichi Narioka, Felix Reinhart, Walter Senn, Kenneth Stanley, and Paul Tonelli for constructive discussions and valuable comments on early drafts of the manuscript. A large part of this work was carried out while the author was with the CoR-Lab at Bielefeld University, funded by the European Community’s Seventh Framework Programme FP7/2007-2013, Challenge 2 Cognitive Systems, Interaction, Robotics under Grant Agreement No. 248311-AMARSi.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Unlearning
Unlearning of the long-term components of the weights can be effectively implemented as symmetrical to learning, i.e., when the transient weights are very negative (lower than \(-\varPsi \)), the long-term component of a weight is decreased. This process represents the validation of the hypothesis that a certain stimulus–action pair is not associated with a reward anymore or that is possibly associated with punishment. In such a case, the neural weight that represents this stimulus–action pair is decreased and so is the probability of occurring. The conversion of negative transient weights to decrements of long-term weights, similarly to Eq. (10), can be formally expressed as
No other changes are required to the algorithm described in the paper.
The case can be illustrated reproducing the preliminary test of Fig. 3, augmenting it with a phase characterized by a negative average modulation. Figure 10 shows that, when modulatory updates become negative on average (from reward 4,000 to reward 5,000), the transient weight detects it by becoming negative. The use of Eq. (11) then causes the long-term component to reduce its value, thereby reversing the previous learning.
Preliminary experiments with unlearning on the complete neural model of this study show that the rate of negative modulation drops drastically as unlearning proceed. In other words, as the network experiences negative modulation, and consequently reduces the frequencies of punishing stimulus–action pairs, it also reduces the rate of unlearning because punishing episodes become sporadic. It appears that unlearning from negative experiences might be slower than that learning from positive experiences. Evidence from biology indicates that extinction does not remove completely the previous association (Bouton 2000, 2004), suggesting that more complex dynamics as those proposed here may regulate this process in animals.
Appendix 2: Implementation
All implementation details are also available as part of the open source Matlab code provided as support material. The code can be used to reproduce the results in this work, or modified to perform further experiments. The source code can be downloaded from http://andrea.soltoggio.net/HTP.
1.1 Network, inputs, outputs, and rewards
The network is a feed-forward single layer neural network with 300 inputs, 30 outputs, 9,000 weights, and sampling time of 0.1 s. Three hundred stimuli are delivered to the network by means of 300 input neurons. Thirty actions are performed by the network by means of 30 output neurons.
The flow of stimuli consists of a random sequence of stimuli each of duration between 1 and 2 s. The probability of 0, 1, 2 or 3 stimuli to be shown to the network simultaneously is described in Table 2.
The agent continuously performs actions chosen from a pool of 30 possibilities. Thirty output neurons may be interpreted as single neurons, or populations. When one action terminates, the output neuron with the highest activity initiates the next action. Once the response action is started, it lasts a variable time between 1 and 2 s. During this time, the neuron that initiated the action receives a feedback signal I of 0.5. The feedback current enables the output neuron responsible for one action to correlate correctly with the stimulus that is simultaneously active. A feedback signal is also used in Urbanczik and Senn (2009) to improve the reinforcement learning performance of a neural network.
The rewarding stimulus–action pairs are \((i,i)\) with \(1 \le i \le 10\) during scenario 1, \((i,i-5)\) with \(11 \le i \le 20\) in scenario 2, and \((i,i-20)\) with \(21 \le i \le 30\) in scenario 3. When a rewarding stimulus–action pair is performed, a reward is delivered to the network with a random delay in the interval [1, 4] s. Given the delay of the reward, and the frequency of stimuli and actions, a number of stimulus–action pairs could be responsible for triggering the reward. The parameters are listed in Table 2. Table 3 lists the parameters of the neural model.
1.2 Integration
The integration of Eqs. (3) and (2) with a sampling time \(\Delta t\) of \(100\) ms is implemented step-wise by
The same integration method is used for all leaky integrators used in this study. Given that \(r(t)\) is a signal from the environment, it might be a one-step signal as in the present study, which is high for one step when reward is delivered, or any other function representing a reward: In a test of RCHP on the real robot iCub (Soltoggio et al. 2013a, b), \(r(t)\) was determined by the human teacher by pressing skin sensors on the robots arms.
1.3 Rarely correlating Hebbian plasticity
Rarely correlating Hebbian plasticity (RCHP) (Soltoggio and Steil 2013) is a type of Hebbian plasticity that filters out the majority of correlations and produces nonzero values only for a small percentage of synapses. Rate-based neurons can use a Hebbian rule augmented with two thresholds to extract low percentages of correlations and decorrelations. RCHP expressed by Eq. (4) is simulated with the parameters in Table 4. The rate of correlations can be expressed by a global concentration \(\omega _{c}\). This measure represents how much the activity of the network correlates, i.e., how much the network activity is deterministically driven by connections or is instead noise-driven. The instantaneous matrix of correlations \(\mathrm{RCHP}^+\) (i.e. the first row in Eq. (4) computed for all synapses) can be low filtered as
to estimate the level of correlations in the recent past, where \(j\) is the index of input neurons, and \(i\) the index of the output neurons. In the current settings, \(\tau _{c}\) was chosen equal to 5 s. Alternatively, a similar measure of recent correlations \(\omega _{c}(t)\) can be computed in discrete time over a sliding time window of 5 s summing all correlations \(\mathrm{RCHP}^+(t)\)
Similar equations to (14) and (15) are used to estimate decorrelations \(\omega _{d}(t)\) from the detected decorrelations \(\mathrm{RCHP}^-(t)\). The adaptive thresholds \(\theta _{hi}\) and \(\theta _{lo}\) in Eq. (4) are estimated as follows.
and
with \(\eta = 0.001\) and \(\mu \), the target rate of rare correlations, set to 0.1 %/s. If correlations are lower than half of the target or are greater than twice the target, the thresholds are adapted to the new increased or reduced activity. This heuristic has the purpose of maintaining the thresholds relatively constant and perform adaptation only when correlations are too high or too low for a long period of time.
Rights and permissions
About this article
Cite this article
Soltoggio, A. Short-term plasticity as cause–effect hypothesis testing in distal reward learning. Biol Cybern 109, 75–94 (2015). https://doi.org/10.1007/s00422-014-0628-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00422-014-0628-0