Learning with an Asymmetric Teacher: Asymmetric Dopamine-Like Response Can Be Used as an Error Signal for Reinforcement Learning
According to many computational models, dopamine (DA) neurons in the basal ganglia (BG) play a major role in reinforcement learning. DA signal is proportional to the difference between actual and predicted reward, hence it could serve as the error signal in a temporal difference (TD) learning algorithm implemented in the BG. Indeed, a proportional increase in the firing rate of DA neurons in states with higher values than expected (the positive domain of the error signal) has been found experimentally. However, many studies indicate that DA neurons do not decrease their firing rate symmetrically in the negative domain. Some studies report a smaller gain relative to the positive domain, whereas others report a decrease in the firing to a constant level.
Our work focuses on using such an asymmetric error signal in a TD-like computational algorithm. We simulated a probabilistic classical conditioning task in which the agent sequentially received stimuli with different probabilities of reward or aversion. The algorithm calculated the value of each of the stimuli, using an asymmetric TD signal. This was done by manipulating the negative domain of the error signal function by decreasing its slope by a multiplicative factor smaller than one, fixing its negative values to a constant negative level, or fixing them to zero.
We show that learning can be achieved when the negative domain of the error signal function is either constant negative or with reduced gain, although it is slower than with a symmetric error signal. However, learning cannot be achieved with a constant value of zero for the negative domain of the error signal function. We examined learning by comparing the values the algorithm assigned to the stimuli with those assigned by a symmetric TD algorithm. These values had a nonlinear concave trend, with higher calculated values than those assigned by the symmetric TD algorithm.
This suggests that the DA asymmetric signal could be used as an error signal in a TD algorithm as implemented in the BG and that DA asymmetric coding does not require a complementary BG modulatory error signal. We further hypothesize that the concave values of the error signal curve could thus lead to some aspects of irrational human behavior.
KeywordsBasal Ganglion Conditioned Stimulus Firing Rate Error Signal Unconditioned Stimulus
- Joshua M, Adler A, Mitelman R, Vaadia E and Bergman H (2008) Midbrain dopaminergic neurons and striatal cholinergic interneurons encode the difference between reward and aversive events at different epochs of probabilistic classical conditioning trials. J Neurosci 28: 11673–11684CrossRefPubMedGoogle Scholar
- Pavlov IP (1927) Conditioned Reflexes. Oxford University Press, OxfordGoogle Scholar
- Sutton RS and Barto AG (1998) Reinforcement Learning: An Introduction. MIT, Cambridge, MAGoogle Scholar