Evaluating the TD model of classical conditioning
- 1.5k Downloads
The temporal-difference (TD) algorithm from reinforcement learning provides a simple method for incrementally learning predictions of upcoming events. Applied to classical conditioning, TD models suppose that animals learn a real-time prediction of the unconditioned stimulus (US) on the basis of all available conditioned stimuli (CSs). In the TD model, similar to other error-correction models, learning is driven by prediction errors—the difference between the change in US prediction and the actual US. With the TD model, however, learning occurs continuously from moment to moment and is not artificially constrained to occur in trials. Accordingly, a key feature of any TD model is the assumption about the representation of a CS on a moment-to-moment basis. Here, we evaluate the performance of the TD model with a heretofore unexplored range of classical conditioning tasks. To do so, we consider three stimulus representations that vary in their degree of temporal generalization and evaluate how the representation influences the performance of the TD model on these conditioning tasks.
KeywordsAssociative learning Classical conditioning Timing Reinforcement learning
Classical conditioning is the process of learning to predict the future. The temporal-difference (TD) algorithm is an incremental method for learning predictions about impending outcomes that has been used widely, under the label of reinforcement learning, in artificial intelligence and robotics for real-time learning (Sutton & Barto, 1998). In this article, we evaluate a computational model of classical conditioning based on this TD algorithm. As applied to classical conditioning, the TD model supposes that animals use the conditioned stimulus (CS) to predict in real time the upcoming unconditioned stimuli (US) (Sutton & Barto, 1990). The TD model of conditioning has become the leading explanation for conditioning in neuroscience, due to the correspondence between the phasic firing of dopamine neurons and the reward-prediction error that drives learning in the model (Schultz, Dayan, & Montague, 1997; for reviews, see Ludvig, Bellemare, & Pearson, 2011; Maia, 2009; Niv, 2009; Schultz, 2006).
The TD model can be viewed as an extension of the Rescorla–Wagner (RW) learning model, with two additional twists (Rescorla & Wagner, 1972). First, the TD model makes real-time predictions at each moment in a trial, thereby allowing the model to potentially deal with intratrial effects, such as the effects of stimulus timing on learning and the timing of responses within a trial. Second, the TD algorithm uses a slightly different learning rule with important implications. As will be detailed below, at each time step, the TD algorithm compares the current prediction about future US occurrences with the US predictions generated on the last time step. This temporal difference in US prediction is compared with any actual US received; if the latter two quantities differ, a prediction error is generated. This prediction error is then used to alter the associative strength of recent stimuli, using an error-correction scheme similar to the RW model. This approach to real-time predictions has the advantage of bootstrapping by comparing successive predictions. As a result, the TD model learns whenever there is a change in prediction, and not only when USs are received or omitted. This seemingly subtle difference makes empirical predictions beyond the scope of the RW model. For example, the TD model naturally accounts for second-order conditioning. When an already-established CS occurs, there is an increase in the US prediction and, thus, a positive prediction error. This prediction error drives learning to the new, preceding CS, producing second-order conditioning (Sutton & Barto, 1990).
In the RW model and many other error-correction models, the associative strength of a single CS trained by itself is a recency-weighted average of the magnitude of all previous US presentations, including trials with no US as the zero point of the continuum (see Kehoe & White, 2002). The relative timing of those previous USs, relative to the CS, does not play a role. So long as the US occurs during the experimenter-designated trial, the US is equivalently included into that running average, which also serves as a prediction of the upcoming US magnitude. In contrast, in TD models, time infuses the prediction process. As was noted above, both the predictions and prediction errors are computed on a moment-by-moment basis. In addition, the predictions themselves can have a longer time horizon, extending beyond the current trial.
In this article, we evaluate the TD model of conditioning on a broader range of behavioral phenomena than have been considered in earlier work on the TD model (e.g., Ludvig, Sutton, & Kehoe, 2008; Ludvig, Sutton, Verbeek, & Kehoe, 2009; Moore & Choi, 1997; Schultz et al., 1997; Sutton & Barto, 1990). In particular, we try to highlight those issues that distinguish the TD model from the RW model (Rescorla & Wagner, 1972). In most situations where the relative timing does not matter, the TD model reduces to the RW model. As outlined in the introduction to this special issue, we focus on the phenomena of timing in conditioning (Group 12) and how stimulus timing can influence fundamental learning phenomena, such as acquisition (Group 1), blocking, and overshadowing (Group 7). To illustrate how the TD model learns in these situations, we present simulations with three stimulus representations, each of which makes different assumptions about the temporal granularity with which animals represent the world.
This V is an estimate of the value in the context of reinforcement learning theory (Sutton & Barto, 1998) and is equivalent to the aggregate associative strength central to many models of conditioning (Pearce & Hall, 1980; Rescorla & Wagner, 1972). We will primarily use the term US prediction to refer to this core variable in the model. In the learning algorithm, each element of the stimulus representation (or sensory input) has an associated weight that can be modified on the basis of the accuracy of the US prediction. In the simplest case, every stimulus has a single element that is on (active) when that stimulus is present and off (inactive) otherwise. The modifiable weight would then be directly equivalent to the US prediction supported by that stimulus. Below, we will discuss in detail some more sophisticated stimulus representations.
The truncated square brackets indicate that only the suprathreshold portion of the US prediction is integrated into the response, which is interpreted as the CR level (see Kehoe, Ludvig, Dudeney, Neufeld, & Sutton, 2008; Ludvig et al., 2009). This response measure can be readily mapped in a monotonic fashion onto either continuous measures (e.g., lick rate, suppression ratios, food cup approach time) or response likelihood measures based on discrete responses. Thus, comparisons can be conducted across experiments on the basis of ordinal relationships, rather than differing preparation-specific levels (cf. Rescorla & Wagner, 1972).
This formulation emphasizes the similarity with the RW rule, where a simple difference between the US intensity and a prediction of that intensity drives learning. In this formulation, a positive prediction error occurs when the US intensity exceeds a temporal difference in US prediction, and a negative prediction error occurs whenever the US intensity falls below the temporal difference in US prediction. Note that if γ is 0, the prediction error is identical to the RW prediction error, making real-time RW a special case of TD learning.
Complete serial compound
At the opposite end from a single representational element per stimulus is a separate representational element for every moment in the stimulus (first column in Fig. 1). The motivating idea is that extended stimuli are not coherent wholes but, rather, temporally differentiated into a serial compound of temporal elements. In a complete serial compound (CSC), every time step in a stimulus is a unique element (separate rows in Fig. 1). This CSC representation is used in most of the TD models of dopamine function (e.g., Montague, Dayan, & Sejnowski, 1996; Schultz, 2006; Schultz et al., 1997) and is often taken as synonymous with the TD model (e.g., Amundson & Miller, 2008; Church & Kirkpatrick, 2001; Desmond & Moore, 1988; Jennings & Kirkpatrick, 2006), although alternate proposals do exist (Daw, Courville, & Touretzky, 2006; Ludvig et al., 2008; Suri & Schultz, 1999). Although clearly biologically unrealistic, in a behavioral model, this CSC representation can serve as a useful fiction that allows examination of different learning rules unfettered by constraints from the stimulus representation (Sutton & Barto, 1990). This stimulus representation occupies the far pole along the temporal generalization gradient from the presence representation (see Fig. 1). There is complete temporal differentiation of every moment and no temporal generalization whatsoever.
A third stimulus representation occupies an intermediate zone of limited temporal generalization between complete temporal generalization (presence) and no temporal generalization (CSC). The middle column of Fig. 1 depicts what such a representation looks like: Each successive row presents a microstimulus (MS) that is wider and shorter and peaks later.
The resulting microstimuli bear a resemblance to the spectral traces of Grossberg and Schmajuk (1989; also Brown, Bullock, & Grossberg, 1999; Buhusi & Schmajuk, 1999), as well as the behavioral states of Machado (1997). We do not claim that the exact quantitative form of these microstimuli is critical to the performance of the TD model below, but rather, we are examining how the general idea of a series of broadening microstimuli with increasing temporal delays interacts with the TD learning rule. Our claim will be that introducing a form of limited temporal generalization into the stimulus representation and using the TD learning rule creates a model that captures a broader space of conditioning phenomena.
To be clear, here is a fully worked example of what the TD model learns in simple CS–US acquisition. In this example, the CS–US interval is 25 time steps, and we assume a CSC representation and one-step eligibility traces (λ = 0). First, imagine the time point at which the US occurs (time step 25). On the first trial, the weight for time step 25 is updated toward the US intensity, but nothing else changes. On the second trial, the weight for time step 24 also gets updated, because there is now a temporal difference in the US prediction between time step 24 and time step 25. The weight for time step 24 is moved toward the discounted weight for time step 25 (bootstrapping). The weight for time step 25 gets updated as before after the US is received. On the third trial, the weight for time step 23 would also get updated, because there is now a temporal difference between the US predictions at time steps 23 and 24. . . . Eventually, across trials, the prediction errors (and thus, US predictions) percolate back one step at a time to the earliest time steps in the CS. This process continues toward an asymptote, where the US prediction at time step 25 matches the US intensity and the predictions for earlier time steps match the appropriately discounted US intensity. If we take the US intensity to be 1, because of the discounting, the asymptotic prediction for time step 24 is γ, which is equal to 0 (the US intensity at that time step) + 1 × γ (the discounted US intensity from the next step). Following the same logic, the asymptotic predictions for earlier time points in the stimulus are γ 2 , γ 3 , γ 4 , . . . and so on, forming an exponential US prediction curve (see also Eq. 9). Introducing multistep eligibility traces (λ > 0) maintains these asymptotic predictions but allows for the prediction errors to percolate back through the CS faster than one step per trial.
For the simulations below, we chose a single set of parameters to illustrate the qualitative properties of the model, rather than attempting to maximize goodness of fit to any single data set or subset of findings. By using a fixed set of parameters, we tested the ability of the model to reproduce the ordinal relationships for a broad range of phenomena, thus ascertaining the scope of the model in a consistent manner (cf. Rescorla & Wagner, 1972, p. 77). The full set of parameters is listed in the Appendix.
For this first set of results, we simulated the acquisition of a CR with the TD model, explicitly comparing the three stimulus representations described previously (see item #1.1 listed in the introduction to this issue). We focus here on the timing of the response during acquisition (#12.4) and the effect of ISI on the speed and asymptote of learning (#12.1).
With a CSC representation, each time point has a separate weight, and therefore, the TD model perfectly produces an exponentially increasing US prediction curve at asymptote (Fig. 2a). For the other two representations, the representations at different time steps are not independent, so the updates for one time step directly alter the US prediction at other time steps through generalization. Given this interdependence, the TD learning mechanism produces an asymptotic US prediction curve that best approximates the exponentially increasing curve observed with the CSC, using the available representational elements. With the presence representation (Fig. 2b), there is only a single weight per stimulus that can be learned. As a result, the algorithm gradually converges on a weight that is slightly below the halfway point between the US intensity (1) and the discounted US prediction at the onset of the stimulus (γ t = .9725 ≈ .47, where t is the number of time steps in the stimulus). The US prediction is nearly constant throughout the stimulus, modulo a barely visible decrease due to nonreinforcement during the stimulus interval. This asymptotic weight is a product of the small negative prediction errors at each time step when the US does not arrive and the large positive prediction error at US delivery. With a near-constant US prediction, the TD model with the presence representation cannot recreate many features of response timing (see also Fig. 4b).
Finally, as is shown in Fig. 2c, the TD algorithm with the MS representation also converges at asymptote to a weighted sum of the different MSs that approximates the exponentially increasing US prediction curve. Even with only six MSs per stimulus, a reasonable approximation is found after 200 trials. An important feature of the MS representation is brought to the fore here. The US also acts as a stimulus with its own attendant MSs. These MSs gain strong negative weights (because the US is not followed by another US) and, thus, counteract any residual prediction about the US that would be produced by the CS MSs during the post-US period.
With the presence representation, there is a substantial decrease in asymptotic CR level with longer ISIs and a small decrease at short ISIs—similar to what has been shown previously (e.g., Fig. 18 in Sutton & Barto, 1990). The decrease with longer ISIs occurs because there is only a single weight with the presence representation. That weight is adjusted on every trial by the many negative prediction errors early in the CS and the single positive prediction error at US onset. With longer ISIs, there are more time points (early in the stimulus) with small negative prediction errors. With only a single weight to be learned and no temporal differentiation, the presence representation thereby results in lower US prediction (and less responding) with longer ISIs. With very short ISIs, in contrast, the high learned US prediction does not quite have enough time to accumulate through the response generation mechanism (Eq. 2), producing a small dip with short ISIs (which is even more pronounced for the MS representation). There is an interesting interaction between learning rate and asymptotic response levels with the presence representation. Although the longer ISIs produce lower rates of asymptotic conditioning, they are actually learned about more quickly because the eligibility trace, which helps determine the learning rate, accumulates to a higher level with longer stimuli (see right panel of Fig. 3b). In sum, with the presence representation, longer stimuli are learned about more quickly, but to a lower asymptote.
With the MS representation, a similar inverted-U pattern emerges. The longest ISI is now both learned about most slowly and to a lower asymptote. The slower learning emerges because the late MSs are shorter than the early MSs (see Fig. 1) and, therefore, produce lower eligibility traces. The lower asymptote emerges because the MSs are inexact and also extend beyond the point of the US. The degree of inexactness grows with ISI. As a result, the maximal US prediction does not get as close to 1 with longer ISIs, and a lower asymptotic response is generated. In this simulation with 200 trials, it is primarily the low learning rate due to low eligibility that limits the CR level with long ISIs, but a similar qualitative effect emerges even with more trials (or a larger learning rate; not shown). Finally, as with the presence representation, the shortest ISI produces less responding because of the accumulation component of the response generation mechanism.
For our second set of simulations, we consider in greater detail the issue of response timing during conditioning (cf. Items #12.4, #12.5, #12.6, and #12.9 as listed in the Introduction to this special issue). In these simulations, there were four different ISIs: 10, 25, 50, and 100 time steps. Simulations were run for 500 trials, and every 5th trial was a probe trial. On those probe trials, the US was not presented, and the CS remained on for twice the duration of the ISI to evaluate response timing unfettered by the termination of the CS (analogous to the peak procedure from operant conditioning; Roberts, 1981; for examples of a similar procedure with eyeblink conditioning, see Kehoe et al., 2008; Kehoe, Olsen, Ludvig, & Sutton, 2009).
Figure 4 illustrates the CR time course for the different stimulus representations and ISIs. As can be seen in the top row, with a CSC representation (left column), the TD model displays a CR time course that is sharply peaked at the exact time of US presentation, even in the absence of the US. In line with those results, the bottom row (Fig. 4b) shows how the peak time is perfectly aligned with the US presentation from the very first CR that is emitted by the model. This precision is due to the perfect timing inherent in the stimulus representation and is incongruent with empirical data, which generally show a less precise timing curve (e.g., Kehoe, Olsen, et al., 2009; Smith, 1968). In addition, the time courses are exactly the same for the different ISIs, only translated along the time axis. There is no change in the width or spread of the response curve with longer ISIs, again in contrast to the empirical data (Kehoe, Olsen, et al., 2009; Smith, 1968). Note again how the maximum response levels are the same for all the ISIs (cf. Fig. 3).
With the MS representation (middle column in Fig. 4), model performance is greatly improved. The response curve gradually increases to a maximum around the usual time of US presentation and gradually decreases afterward for all four ISIs. The response curves are wider for the longer ISIs (although not quite proportionally), approximating one aspect of scalar timing, despite a deterministic representation. The times of the simulated CR peaks (bottom row) are also reasonably well aligned with actual CRs from their first appearance (see Drew, Zupan, Cooke, Couvillon, & Balsam, 2005; Kehoe et al., 2008). In contrast to the CSC representation, the simulated peaks occur early during the initial portion of acquisition but drift later as learning progresses. This effect is most pronounced for the longest ISI—in line with the empirical data for eyeblink conditioning (Kehoe et al., 2008; Vogel, Brandon, & Wagner, 2003). At asymptote, the CR peaks occur slightly (1 or 2 time steps) later than the usual time of US presentation, due to a combination of the estimation error inherent in the coarse stimulus representation and the slight lag induced by the leaky integration of the real-time US prediction (Eq. 2).
Finally, the TD model with the presence representation does very poorly at response timing, as might be expected given that no temporal differentiation in the US prediction curve is possible. The simulated CRs peak too late for the short ISIs (10 and 25 time steps), too early for the medium ISI (50 time steps), and there is no response curve at all for the longest ISI (100 time steps). The late peaks for the short ISI reflect the continued accumulation of the US prediction by the response generation mechanism (Eq. 2) well past the usual time of the US as the CS continues to be present. The disappearance of a response for the longest ISI is due to the addition of probe trials in this simulation (cf. Fig. 3). With only a single representational element, the probe trials are particularly detrimental to learning with the presence representation. Not only is no US present on the probe trials, but also the CS (and thus, the presence element) is extended. This protracted period of unreinforced CS presentation operates as a doubly long extinction trial, driving down the weight for the lone representational element and reducing the overall US prediction below the threshold for the response generation mechanism. Indeed, making the probe trials longer would drive down the US prediction even further, potentially eliminating responding for the shorter ISIs as well. The other representations do not suffer from this shortcoming, because the decline in the weights of elements during the extended portion of the CS generalizes only weakly, if at all, to earlier portions of the CS.
Cue competition set
The top panel of Fig. 5 (Fig. 5a) shows responding on the probe trials when both CSs had identical ISIs (50 time steps). With all three representations, and in accord with the empirical data (e.g., Cole & McNally, 2007; Jennings & Kirkpatrick, 2006; Kehoe et al., 1981), there is complete blocking of responding to CSB but high levels of responding to both CSA and the CSA + CSB compound stimulus. With the CSC representation, the blocking occurs because the US is perfectly predicted by the pretrained CSA after the first phase of training; thus, there is no prediction error and no learning to the added CSB. With the presence and MS representations, the US is not perfectly predicted by CSA after the first phase (cf. Fig. 2). As a result, when the US occurs, there is still a positive prediction error, which causes an increase in the weights of eligible CS elements. The increment in the weights induced by this prediction error, however, is perfectly cancelled out by the ongoing, small negative prediction errors during the compound stimulus on the next trial. No net increase in weights occurs to either stimulus, resulting in blocking to the newly introduced CSB (see the section on acquisition above).
If, instead, the added CSB starts later than the pretrained CSA (see Fig. 5b), the simulated results change somewhat, but only for the TD model with the presence representation. In these simulations, during the second phase, CSB was trained with an ISI of 25 time steps, and CSA was still trained with an ISI of 50 time steps. As a result, the CSB started 25 time steps after CSA and lasted half as long. In this case, there is still full blocking of CSB with the CSC and MS representations, but not with the presence representation. With the latter representation, there is some responding to the blocked CSB and a sharp decrease in responding to CSA, as compared with the condition with matched ISIs (Fig. 5a). This responding to CSB occurs because CSB is a shorter stimulus that is more proximal to the US. As a result, there is the same positive prediction error on US receipt but fewer negative prediction errors during the course of the compound stimulus. Over time, CSB gradually “steals” the associative strength of the earlier CSA.
Empirically, when the onset of the added CSB occurs later than the onset of the pretrained CSA (as in Fig. 5b), acquisition of responding to the added CSB is largely blocked in eyeblink, fear, and appetitive conditioning (Amundson & Miller, 2008, Experiment 1; Jennings & Kirkpatrick, 2006, Experiment 2), as predicted by the TD model with the CSC or MS representations. When measured, responding to the pretrained CSA has shown mixed results. In rabbit eyeblink conditioning, responding to CSA has remained high (Kehoe et al., 1981), consistent with the representations that presume limited temporal generalization (MS and CSC). In appetitive conditioning in rats, however, CSA has suffered some loss of responding (Jennings & Kirkpatrick, 2006, Experiment 2), more consistent with a greater degree of temporal generalization (presence).
Finally, Fig. 5c depicts the results of simulations when the onset of the added CSB precedes the pretrained CSA during compound training. In this simulation, the CSA was always trained with an ISI of 25 time steps, and the CSB was trained with an ISI of 50 time steps. In this case, blocking was attenuated in the model with all three stimulus representations. For the TD model, this attenuation derives from second-order conditioning (#11.2): CSB, with its earlier onset, comes to predict the onset of the blocking stimulus CSA. In TD learning, the change in the US prediction at the onset of CSA (see Eq. 3) produces a prediction error that changes the weights for the preceding CSB. For the presence and MS representations, this second-order conditioning has an additional consequence. Because the elements of the stimulus representation for the added CSB overlap with those of CSA, the response to CSA diminishes (significantly more so for the presence representation).
In the empirical data, when the onset of the added CSB occurred earlier than the pretrained CSA, responding to the added CSB showed little evidence of blocking (Amundson & Miller, 2008, Experiment 2; Cole & McNally, 2007; Jennings & Kirkpatrick, 2006, Experiment 2; Kehoe et al., 1987). Independently of stimulus representation, the TD model correctly predicts attenuated blocking of responding to CSB in this situation (Fig. 5c), but not an outright absence of blocking. Responding to the pretrained CSA, in contrast, showed progressive declines after CSB was added in eyeblink and fear conditioning (Cole & McNally, 2007; Kehoe et al., 1987), consistent with the presence representation, but not in appetitive conditioning (Jennings & Kirkpatrick, 2006), more consistent with the CSC and MS representations.
In these simulations of blocking, the presence representation is the most easily distinguishable from the other two representations, which presuppose less-than-complete temporal generalization. Most notably, with a presence representation, the TD model predicts that responding will strongly diminish to the pretrained stimulus (CSA) when the onset of the added stimulus (CSB) occurs either later or earlier than CSA. In contrast, the CSC and MS predict that responding to the pretrained CSA will remain at a high level.
We further consider one additional variation on blocking, where the ISI for the blocking stimulus CSA changes between the elemental and compound conditioning phases of the blocking experiment (e.g., Exp. 4 in Amundson & Miller, 2008; Schreurs & Westbrook, 1982). Empirically, in these situations, blocking is attenuated with the change in ISI. Once again, in these simulations, the blocking CSA was first paired with the US for 200 trials, but with an ISI of 100 time steps. Compound training also proceeded for 200 trials, and both stimuli had an ISI of 25 time steps during this phase.
A second cue competition scenario that we simulated is overshadowing (#7.5)—often observed when two stimuli are conditioned in compound. In the overshadowing simulations below, the overshadowed CSB always had an ISI of 25 time steps. We included four overshadowing conditions in these simulations, where the overshadowing CSA had ISIs of 25 time steps (same), 50 time steps (long), or 100 time steps (longer) or was omitted altogether (none). Training proceeded for 200 trials, and 3 probe trials were included at the end of training: CSA alone, CSB alone, and a compound stimulus (CSA + CSB together).
The different representations also produce different predicted time courses of responding to the overshadowing and overshadowed stimuli. Figure 7c shows the CR time course during the probe trials after overshadowing with asynchronous stimuli (the long condition in Fig. 7b). For the CSB alone and the compound stimulus, the response curves are quite similar with the different representations (modulo the quirks in the shape of the timing function highlighted in Fig. 4), with a gradually increasing response that peaks around the time of the US. With the presence and MS representations, there is a small kink upward in responding when the second stimulus (CSB) turns on during the compound trials (compare Fig. 2 in Jennings et al., 2007), because the CSB provides a stronger prediction about the upcoming US. For the CSA-alone trials, however, the CR time courses are different for each of the representations. The CSC representation predicts a two-peaked response, the MS representation predicts a slight leftward shift in the time of maximal responding relative to the US time, and the presence representation predicts flat growth until CS termination. The empirical data seem to rule out the time course predicted by a CSC representation but are not clear in distinguishing the latter two possibilities (Jennings et al., 2007).
This overshadowing simulation also captures part of the information effect in compound conditioning (#7.11; Egger & Miller, 1962); with a fully informative and earlier CSA present, responding to CSB is sharply reduced (Fig. 6b). In addition, making CSA less informative by inserting CSA-alone trials during training reduces the degree to which responding to CSB is reduced (simulation not shown, but see Sutton & Barto, 1990).
In this article, we have evaluated the TD model of classical conditioning on a range of conditioning tasks. We have examined how different temporal stimulus representations modulate the TD model’s predictions, with a particular focus on those findings where stimulus timing or response timing are important. Across most tasks, the microstimulus representation provided a better correspondence with the empirical data than did the other two representations considered. With only a fixed set of parameters, this version of the TD model successfully simulated the following phenomena from the focus list for this special issue: 1.1, 7.2, 7.5, 7.11, 12.1, 12.4, 12.5, 12.6, 12.7, and 12.9.
Although not a causal explanation (this lies in the mechanism described in Eqs. 1–8), such a teleological interpretation can be very helpful in understanding the functioning of the proximal learning mechanism. For example, let us return to the question of why the US prediction takes a given time course with the MS representation (Fig. 2c). In this case, a teleological interpretation is that the TD learning algorithm is trying to best approximate the target US prediction, which is an exponentially weighted average of future USs. This target US prediction is exactly recreated by the US prediction curve for the complete serial compound (Fig. 2a), which can approach the target curve without constraints. In contrast, the TD model with MSs will find the best linear weighting of the MSs that approximates this target curve.
A notable feature of the MS TD model is that good timing results emerge with only a small handful of deterministic representational elements per stimulus (e.g., Fig. 4). The MS TD model exhibits proportional timing (peaks at the right time), a graded response curve, an increase in the width of the response curve with ISI (although subproportional), well-aligned peak times from early in conditioning, and inhibition of delay. Because the TD learning algorithm finds the best linear combination of representation elements to approximate the target US prediction curve, there is no need to fully cover the space with basis functions that each have unique maximal response times. This sparse coverage approach differs from most other timing models in the spectral timing family (e.g., Grossberg & Schmajuk, 1989; Machado, 1997; but see Buhusi & Schmajuk, 1999), improves upon earlier versions of the MS TD model that used many more MSs per stimulus (Ludvig et al., 2008; Ludvig et al., 2009), and stills one of the main criticisms of this class of models, that they suffer from the “infinitude of the possible” (Gallistel & King, 2009). In addition, as further support for such an approach to learning and timing, a spectrum of MS-like traces have recently been found during temporally structured tasks in the basal ganglia (Jin, Fujii, & Graybiel, 2009) and hippocampus (MacDonald, Lepage, Eden, & Eichenbaum, 2011).
The presence representation, however, did produce a better correspondence with some aspects of the data from asynchronous blocking (Fig. 5). For example, the decrease in responding to the blocking CSA when preceded by the blocked CSB is predicted only by the presence representation (Fig. 5c). In previous work, we have examined a hybrid representation that uses both an MS spectrum and a presence bit to model some differences between trace and delay conditioning (Kehoe, Olsen, Ludvig, & Sutton, 2009b; Ludvig et al., 2008). Such a hybrid representation can have the advantages of both constituent representations, but the interaction between the representations quickly gets complicated even in simple situations, perhaps limiting its explanatory value (see Ludvig et al., 2009).
In these simulations, we have necessarily focused on those findings that both have not been shown before for the TD model and particularly distinguish the TD model from the RW model. Many other results, however, have been previously demonstrated for the TD model or follow trivially from the RW model. For example, Sutton and Barto (1990) demonstrated second-order conditioning (#11.2; see their Fig. 23), extinction (#2.1 and #9.1), and conditioned inhibition (#5.1) in the TD model. In addition, by constraining the US prediction to be nonnegative, they also simulated the failure of conditioned inhibition to extinguish (#5.3; see their Fig. 19; see also Ludvig et al., 2008; Ludvig et al., 2009). Their simulations used a presence representation, but those results hold equally well for the other two representations considered here. Other phenomena follow straightforwardly from the similarity of the TD learning rule to the RW learning rule when questions of timing are removed (see Ludvig et al., 2011, for some discussion of this point): Overexpectation (#7.8), unblocking by increasing the US (#7.3), and superconditioning (#7.9) are all predicted by any TD model.
Other extensions to the TD model have been proposed that expand the reach of the model to other conditioning phenomena that we have not considered here. For example, Ludvig and Koop (2008) proposed a scheme for learning a predictive representation with a TD model that allows for generalization (#3.1) between situations on the basis of their anticipated future outcomes. With this representation, they showed how a TD model could exhibit sensory preconditioning (#11.1 and #11.4), mediated conditioning (#11.5), and acquired equivalence (e.g., Honey & Hall, 1989). Pan, Schmidt, Wickens, and Hyland (2008) proposed a different extension to the TD model, which supposed separate excitatory and inhibitory weights that decayed at different rates. They showed that this formulation produced both spontaneous recovery (#10.6) and rapid reacquisition (#9.2).
No model is perfect, and the TD model is no exception. Several of the major classes of phenomena under consideration in this special issue lie beyond the explanatory power of current TD models, including most of the phenomena grouped under discrimination (Group 4), preexposure (Group 8), and recovery (Group 10). Future research will hopefully provide new angles for integrating these results with the TD model. These extensions will require new formalisms that may attach additional components to the TD model, such as memory- or model-based learning (Daw, Niv, & Dayan, 2005; Ludvig, Mirian, Sutton, & Kehoe, 2012; Sutton, 1990), step-size adaptation (Pearce & Hall, 1980; Sutton, 1992), or additional configural representational elements (Pearce, 1987, 1994). These new developments will likely feature prominently in the next generation of computational models of animal learning.
Preparation of this manuscript was supported by Alberta Innovates--Technology Futures and the National Science and Engineering Research Council of Canada.
- Church, R. M., & Kirkpatrick, K. (2001). Theories of conditioning and timing. In R. R. Mowrer & S. B. Klein (Eds.), Contemporary learning: Theory and applications (pp. 211–253). Hillsdale, NJ: Erlbaum.Google Scholar
- Ludvig, E. A., Bellemare, M. G., & Pearson, K. G. (2011). A primer on reinforcement learning in the brain: Psychological, computational, and neural perspectives. In E. Alonso & E. Mondragon (Eds.), Computational neuroscience for advancing artificial intelligence: Models, methods and applications (pp. 111–144). Hershey, PA: IGI Global.Google Scholar
- Ludvig, E. A., & Koop, A. (2008). Learning to generalize through predictive representations: A computational model of mediated conditioning. In From Animals to Animats 10: Proceedings of Simulation of Adaptive Behavior (SAB-08), 342–351.Google Scholar
- Ludvig, E. A., Mirian, M. S., Sutton, R. S., & Kehoe, E. J. (2012). Associative learning from replayed experience. Manuscript submitted for publication.Google Scholar
- Ludvig, E. A., Sutton, R. S., Verbeek, E. L., & Kehoe, E. J. (2009). A computational model of hippocampal function in trace conditioning. Advances in Neural Information Processing Systems (NIPS-08), 21, 993–1000.Google Scholar
- Moore, J. W., & Choi, J. S. (1997). The TD model of classical conditioning: Response topography and brain implementation. In J. W. Donahoe & V. P. Dorsel (Eds.), Neural-network models of cognition, biobehavioral foundations (Advances in Psychology (pp, Vol. 121, pp. 387–405). Amsterdam: North-Holland/Elsevier.CrossRefGoogle Scholar
- Moore, J. W., Desmond, J. E., Berthier, N. E., Blazis, D. E. J., Sutton, R. S., & Barto, A. G. (1986). Simulation of the classically conditioned nictitating membrane response by a neuron-like adaptive element: Response topography, neuronal firing and inter-stimulus intervals. Behavioral Brain Research, 21, 143–154.CrossRefGoogle Scholar
- Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. In A. H. Black & W. F. Prokasy (Eds.), Classical conditioning II (pp. 64–99). New York: Appleton-Century-Crofts.Google Scholar
- Schreurs, B. G., & Westbrook, R. F. (1982). The effects of changes in the CS–US interval during compound conditioning upon an otherwise blocked element. Quarterly Journal of Experimental Psychology, 34B, 19–30.Google Scholar
- Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.Google Scholar
- Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. International Conference on Machine Learning (ICML), 7, 216–224.Google Scholar
- Sutton, R. S. (1992). Adapting bias by gradient descent: An incremental version of delta-bar-delta. National Conference on Artificial Intelligence, 10, 171–176.Google Scholar
- Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement. In M. Gabriel & J. W. Moore (Eds.), Learning and computational neuroscience (pp. 497–537). Cambridge, MA: MIT Press.Google Scholar
- Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.Google Scholar