Modeling Linguistic Performance

The goal of ACT-R is to provide accurate cognitive models of learning and performance, as well as accurate neural mappings of cognitive activities. In this chapter, we introduce the ‘subsymbolic’ declarative memory components of ACT-R. These are essential to modeling performance, i.e., actual human behavior in experimental tasks. We then build end-to-end models for a variety of psycholinguistic tasks—list recall, lexical decision, (self-paced) reading—and evaluate how closely these models fit the actual data. The models we build are end-to-end in the sense that they include explicit linguistic analyses that are primarily encoded in the production rules (i.e., in procedural memory), together with a realistic model of declarative memory and simple, but reasonably realistic, vision and motor modules.


The Power Law of Forgetting
The main idea behind the ACT-R declarative memory architecture is that human memory is behaving optimally with respect to the pattern of past information presentation. Each item in memory has had some history of past use. For instance, our memory for one person's name may not have been used in the past month but might have been used five times in the month previous to that. What is the probability that the memory will be needed (used) during the conceived current day? Memory would be behaving optimally if it made this memory less available than memories that were more likely to be used but made it more available than less likely memories. (Anderson and Schooler 1991, 396) In particular, the availability of a specific chunk stored in declarative memory, i.e., its activation, which determines both the probability that it will be successfully retrieved and its retrieval time/latency, is a function of the past use of that memory chunk (among other things; more about other factors later).
To see how this is actually formalized in ACT-R, let's examine the well-known Ebbinghaus (1913) retention data, presented in his Chap. 7 and shown here in [py13] below. The stimulus materials used by Ebbinghaus consisted of nonsense CVC syllables, about 2300 in number. They were mixed together and then, syllables were randomly selected to construct lists of different lengths that needed to be memorized.
The method used to memorize them was 'learning to criterion': Ebbinghaus repeated the list as many times as necessary to reach a prespecified level of accuracy (e.g., one perfect reproduction of the list). The retention measure was 'percent savings', which was computed as follows. First, a list was learned to criterion, and Ebbinghaus counted the number of times he needed to repeat the list until that happened. Let's say he needed to repeat the list 10 times until he could reproduce it perfectly. He waited one day, and then he tried to see how much he still remembered. The way Ebbinghaus chose to measure 'how much' is by seeing how many times he needed to repeat the list on the second day until he learned it to criterion again. Let's say that he needed to repeat the list only 7 times on the second day. So, he saved 3 list repetitions compared to the first day, that is, he had 30% savings ( 10−7 10 ) after one day.
[py13] >>> # loading the data 1 >>> import pandas as pd 2 >>> ebbinghaus_data = pd.read_csv ('./data/ebbinghaus_retention_data.csv In [py13], we load the Ebbinghaus data from a csv file using the pandas library (lines 2-3). The data is displayed on lines 5-12. We see that there are 7 data points/observations (7 rows, numbered 0 through 6). The first column in the data is the independent variable (time): it records the delay in hours that the relearning of the syllable series took place relative to the initial learning-to-criterion time. The second column is the dependent variable (the measure of activation in memory): it records the percent savings observed for the corresponding delay in hours, i.e., the reduction in repetitions of the target series of syllables needed to relearn it to criterion.
For completeness, we provide the summary of the Ebbinghaus data in [py14] below. This gives us the total number of observations (7), the means for the delay in hours and savings percentages (138.59 and 35.17 respectively), the standard deviation etc.
[py14] >>> ebbinghaus_data.describe ()  A much better way to develop an intuitive understanding of this data is to plot it. In [py15] below, we load the visualization (plotting) libraries matplotlib and seaborn, and specify a variety of options for them (lines 2-20). If working in the terminal, you should probably simply load these libraries and accept their default settings. We can now define a function to plot the data, and then call it (lines 22-57 in [py15]). The resulting 3 plots are provided in Fig. 6.1.
[py15] >>> # settings for data visualization 1 >>> import matplotlib as mpl 2 >>> mpl.use("pgf") 3 >>> pgf_with_pdflatex = {"text.usetex": True, "pgf.texsystem": "pdflatex", 4 ... "pgf.preamble": [r"\usepackage{mathpazo}", 5 . The three panels in Fig. 6.1 plot the retention data in its non-transformed form (panel a), with a logarithmically compressed y axis (panel b: we plot log percent savings), and with both axes logarithmically compressed (panel c). We see that a linear relation emerges in the final log-log plot, which indicates that forgetting (decay of chunk activation in declarative memory) has a particular functional form-to which we now turn. (The log tick marks are in base 10 for readability, although we always work with the natural logarithm, i.e., log base e, in what follows.) The forgetting curve in plot (a) of Fig. 6.1 is sometimes taken to reflect an underlying negative exponential forgetting function of the form: (1) P = αe −βT , where: -P is the memory-related performance measure (percent savings in the Ebbinghaus data), -T is the time delay since presentation (since initial learning to criterion in our case), and α, β are the free parameters of the model, to be fit to the data.
But this predicts that performance should be a linear function of time if we logtransform the performance P: (2) log(P) = log(α) − βT , i.e., a linear function of time T with intercept log(α) and negative slope −β One way to intuitively think about logarithmic transformation/logarithmic compression is to think about a series of evenly spaced trees that you can see on the side of a long straight road as you look up the road. The distances between the trees appear smaller and smaller as the trees are further and further away, until the trees basically become one tree as the gaze approaches the horizon. The further away two trees are from us, the smaller the distance between them seems to be.
Similarly, the larger two numbers are, the more the difference between them is compressed: the difference between 4 and 2 is compressed much less than the difference between 14 and 12 under the log transform. Equivalently, the larger a number is, the higher its compression under the log transform. This is shown on lines 3-6 in [py16] below, as well as in the plot of the log transform (for x ≥ 1) in Fig. 6 Throughout this book, when we use logarithm simpliciter without explicitly specifying the base, we always mean the natural logarithm function, i.e., log base e. We will therefore abbreviate the natural logarithm of a number x simply as log(x) or, dropping parentheses, log x (rather than ln x or log e x).
The idea behind the exponential model of forgetting is that, once we logarithmically compress performance, log performance will be a linear function of time. However, panel (b) of Fig. 6.1 shows that this is not the case: the relationship between the delay on the x-axis, measured in hours, and savings on the y-axis, measured in log-transformed percentages, is still not linear.
We can use our recently acquired knowledge of Bayesian modeling with pymc3 to compare the actual observations and the predictions made by a theory that hypothesizes that performance (forgetting) is an exponential function of time.
In [py17], we import the relevant libraries and then store the delay and savings data in separate variables for convenience (lines 1-6). We then write up the exponential model directly from the equation in (2): the likelihood function defined on lines 15-17 says that log savings (i.e., log performance) is a linear function of delay (with two free parameters intercept and slope), plus some normally distributed noise with standard deviation sigma.
The hypothesis that log savings are a linear function of delay is tantamount to saying that, if we plot the mean mu of log savings for any given delay, we obtain a line. A line is standardly characterized in terms of an intercept and a slope (line 15 in [py17] ): mu is a deterministic function of delay, given parameters intercept and slope. The intercept corresponds to log(α) in formula (2) above, and the slope corresponds to −β.
Lines 11-13 in [py17] provide low information priors for the intercept, slope, and noise. The priors have forms familiar from the previous chapter. We set the standard deviations for all priors to 100, which is very non-committal since the response/dependent variable is measured in log-percent units.
Once the priors and likelihood are specified, we can run the model. We save the result, i.e., our posterior estimates for the parameters intercept, slope and sigma, in the variable trace.

25
With the posterior distributions for our exponential model in hand, we can compare the predictions made by the model against the actual data to see how close the predictions are. The predictions are stored in the variable mu. These are predicted log savings. If we exponentiate them, we obtain predicted savings.
Furthermore, if we look at the 95% credible intervals for the predicted savings, we can see the range of predictive variability/uncertainty in the predictions made by the exponential model. If the actual savings fall within these credible intervals, we can take the model to be empirically adequate. The code in [py18] below generates two plots that enable us to empirically evaluate the exponential model. The two plots are provided in Fig. 6 The plot in the top panel of Fig. 6.3 reproduces the middle panel of Fig. 6.1, together with the line of best fit predicted by the exponential model. It is clear that the line does not match the actual data very well.
This lack of empirical adequacy is also visible in the second plot of Fig. 6.3, which plots the percent savings predicted by the exponential model on the y-axis against the observed percent savings on the x-axis. The red diagonal line indicates the points where the predictions would be exactly equal to the observed values.
We see that the median predicted savings are not very close to the observed values, especially for higher savings (associated with a short delay). Some of the points are pretty far from the diagonal line, and some of the 95% intervals do not cross the diagonal line at all, or barely cross it.
The fact that the points are pretty far from the diagonal line indicates that the exponential model makes incorrect predictions. The fact that some of the 95% intervals around those median predictions do not cross the diagonal line, or barely cross it, indicates that the exponential model is not only wrong, but it is also pretty confident about some of its incorrect predictions.
We can now fairly confidently conclude that memory performance (forgetting) is not a negative exponential function of time. Instead, plot (c) in Fig. 6.1 shows that performance is a power function of time. That is, performance is a linear function of time only if both performance and time are log-transformed: (4) The power law of forgetting: P = αT −β (final form of the forgetting function, obtained by exponentiating both sides of (3)) A line fits the log-log (log savings-log delay) data very well. Once again, we can set up a Bayesian model that directly implements the formula in (3), and then examine its predictions. The code for the power law model is provided in [py19], and the code generating two plots parallel to the plots we generated for the exponential model is provided in [py20]. The resulting plots are provided in Fig. 6 The top plot in Fig. 6.4 reproduces the log-log plot in the third panel of Fig. 6.1, together with the line of best fit predicted by the power law model of forgetting. We see that the model predictions match the data very well. This is further confirmed by the second plot in Fig. 6.4. The points are almost perfectly aligned with the diagonal. That is, the savings predicted by the power law model are very close to the observed savings.
Furthermore, the confidence intervals around most predictions are so tight that that we do not see any segments extending outward from the plotted points. That is, the power law model makes correct predictions, and it is furthermore highly confident about the predictions it makes (with somewhat less confidence for higher savings).
We conclude that the better model for the Ebbinghaus forgetting data is a power law model, and not an exponential one.

The Base Activation Equation
The ACT-R base activation equation in (5) directly reflects the power law of forgetting: The base activation B i of a chunk i in declarative memory is a log-transformed measure of performance. So the actual measure of performance is e B i , and e B i is a power function of the times t k since the chunk was presented. 'Presentation' in ACT-R really means two things: (i) the chunk was created for the first time, for example, because the human (or the model) was confronted with a new fact, or (ii) the chunk was re-created. Re-creating most often happens when the human (or the model) correctly recalls a chunk, after which the chunk is stored again in memory.
Memory-related performance e B i on a specific chunk i at a specific time of retrieval from memory t now is the sum of t −d k , for all k presentations of the chunk, where k varies from 1 to the total number of chunk presentations n. For each presentation k, t k is the period of time elapsed between the time of presentation k and the time of retrieval t now . That is, t k is the same as the delay variable in the Ebbinghaus data. The negative exponent −d (decay) is the equivalent of the −β slope parameter in our log-log (power-law) model of the Ebbinghaus data.
The basic intuition behind the base activation equation in (5) is that at any point in time, memories vary in how likely they are to be needed and the memory system tries to make available those memories that are most likely to be useful. The memory system can use the past history of use of a memory to estimate whether the memory is likely to be needed now. This view sees human memory in some sense as making a statistical inference. However, it does not imply that memory is explicitly engaged in statistical computations. Rather, the claim is that whatever memory is doing parallels a correct statistical inference. (Anderson and Schooler 1991, 400) What memory is inferring is activation, which reflects "need probability": the probability that we will need a particular chunk now. The basic assumption, already developed in Anderson (1990), is that chunks (facts in the declarative memory) are considered in order of their need probabilities until the need probability of a chunk is so low that it is not worth trying to retrieve that chunk anymore-i.e., the chunk's activation is below a retrieval threshold.
This description of what declarative memory does when it retrieves chunks is serial, but the actual retrieval process is formalized as a parallel process in which chunks are simultaneously accessed, and the one with the highest activation is retrieved (if the activation exceeds the threshold).
Crucially, this theory of declarative memory derives specific predictions about the relationship between activation, which is an unobserved quantity reflecting need probability, and observable/measurable quantities: recall latency (how long retrieving a fact takes) and recall accuracy (what is the probability of a successful retrieval).
The key to understanding the connection between activation on one hand, and recall latency and accuracy on the other hand, is to understand the specific way in which activation reflects need probability. The statement in (6) below is left rather implicit in Anderson (1990) and Anderson and Schooler (1991): (6) Activation B i is the logit (log odds) transformation of need probability: Thus, exponentiated activation e B i , which is the actual measure of performance, is the need odds o i of chunk i: 2 e B i is the odds that chunk i is needed .
Summarizing, the base-level activation equation in (5) says that exponentiated activation e B i , which encodes the 'need odds' of chunk i (the odds that chunk i is needed at the time of retrieval t now ), is a power function of time. This power function has two components, n k=1 and t −d k , which formalize the following: : individual presentations 1 through n of a chunk i have a strengthening impact on the need odds of chunk i; a presentation k additively increases the previous need odds for chunk i -these impacts are summed up to produce a total strength/total need odds for chunk i; t −d k : the strengthening impact of a presentation k on the total need odds for chunk i is a power function of time t −d k , where t k is the time elapsed since presentation k -that is, t k is the delay, i.e., the period of time elapsed between the time of presentation k and the time of retrieval t now ; -raising the delay t k to the −d power (the decay) produces the power law of forgetting.
The parameter d in the base activation equation is usually set to 1 2 , so the equation simplifies to: 2 Recall that odds are a deterministic function of probability: o i = p i 1− p i , where p i is the 'need probability' of chunk i, i.e., the probability that chunk i is needed at the retrieval time t now .
Let's work through an example. Assume we have the following chunk of type word in declarative memory, repeated from Chap. 2, and represented in both graph and AVM form: Assume this chunk is presented 5 times, once every 1.25 s, starting at time 0 s. We want to plot its base-level activation for the first 10 s.
In [py21] below, we define a base_activation function. Its inputs are the vector of presentation times for the chunk (pres_times-the first argument of the function), and also the vector consisting of the moments of time at which we want to compute the activation (moments-the second argument of the function). You can think of these moments of time as potential retrieval times. The output of the base_activation function is the vector base_act of base-activation values at the corresponding moments of time.
• line 2: we initialize the base activation base_act: we set it to be a long vector of 0s, as long as the number of moments we want to compute the activation for; • line 3: the for loop on lines 3-6 in [py21] computes the actual activation: for every point idx (short for 'index') at which we want to compute the activation, we do several things, discussed below; • line 4: we identify the moment in time at which we should compute the activation, namely moments[idx]; we identify the presentation times that precede this moment, namely pres_times<moments[idx], since they are the only presentations contributing to base activation at this moment in time; we retrieve these presentation times and store them in the variable past_pres_times; • lines 5-6: with these past presentation times in hand, we compute base level activation following the base level equation in (7); first, we compute the time intervals since those past presentations: moments[idx]-past_pres_times; then we take the square root of these intervals np.sqrt(...) and then, the reciprocal of those square roots 1/np.sqrt(...); finally, we sum all those reciprocals np.sum(...); • lines 7-9: now, the vector base_act stores exponentiated activations; to get to the actual activations, we need to take the log of the quantities currently stored in base_act; since log(0) is undefined, we identify the non-0 quantities in base_act (line 7), take the log of those quantities and replace them with their logs (lines 8-9).
[py21] >>> def base_activation(pres_times, moments): On line 12 in [py21], we generate a vector of 5 presentation times evenly spaced between 0 and 5000 ms. As shown on line 14, these presentation times are at 0, 1.25, 2.5, 3.75 and 5 s. On line 15, we generate a vector of the moments in time at which we want to compute the activation: we want to see the ebbs and flows of activation for the first ten seconds, and we want to see this every ms, so we generate a vector with 10000 numbers-from 1 to 10000 ms (lines 17-18). Finally, we compute the base activation relative to these moments and presentation times using our base_activation function.
We can now plot the result: the code for the plot is provided in [py22] below, and the plot itself is provided in Fig. 6 We see that the activation of the chunk spikes after each presentation and then drops as a power function of time until the next presentation/spike. We also see that the maximum activation (the height of the spikes) slowly increases with each presentation, just as the decay of activation becomes more and more mild. After the fifth presentation, the activation of the chunk decreases pretty slowly, and even in the long term (at 10 s), its activation is higher than the activation it had shortly after the first presentation (say, at 500 ms). Thus, after repeated presentations, we can say that the chunk has been retained in 'long-term' memory.
What Fig. 6.5 shows is that forgetting + rehearsal is an essential part of remembering. The model captures the common observation that cramming for an exam never works long term (the activation of the newly learned facts decreases very steeply after the first presentation), while properly spaced rehearsals or practice lead to long-term retention.
To conclude this section, we note that ACT-R does not distinguish between shortterm and long-term memory. Both of them are distinct from working memory, which can be thought of as the state of the buffers at any given time. Modeling memory as a power-function of time generates the proper short-term memory behavior (after one presentation), as well as the proper long-term memory behavior (after a series of presentations).

The Attentional Weighting Equation
In addition to base activation, a chunk's activation depends on the context in which it is needed. What counts as "context" within the ACT-R cognitive architecture? Context for cognitive processes is the information that is instantaneously available to the procedural module: all the buffers and the chunks that reside in them (basically, working memory). 3 We know that chunks consist of slot-value pairs. To capture the role of context, ACT-R assumes that any chunk V that appears as the value of some slot in a buffer spreads activation to (i) chunks in declarative memory that have V as one of their values, and (ii) chunks in declarative memory that are content-identical to V , i.e., they consist of the same set of slot-value pairs as V . This context-driven boost in activation for chunks in declarative memory is known as spreading activation.
An example will help shed more light on the workings of spreading activation. Suppose that only one buffer carries a chunk, say, the imaginal buffer. And the chunk in the imaginal buffer is the representation of the word car. We assume that the chunk has four slots: form, meaning, category and number. Each of these slots, in turn, has a chunk as its value: the form, the interpretation, the syntactic category and the morphological number, respectively. Each of these values comes with a weight, as shown in (10) Any chunk i in declarative memory that shares values with the imaginal chunk (10) 4 receives spreading activation proportional to (∝) the weights W j (for j ∈ {1, 2, 3, 4}) of the values that chunk i has in common with the imaginal chunk.
That is, chunk i receives an activation boost just by virtue of containing any of the four values in (10), i.e., the form car (W 1 ), car (W 2 ), N (W 3 ) or sg (W 4 ). Intuitively, sharing a value with a context chunk (like the car chunk in the imaginal buffer) 'connects' chunk i in declarative memory to the context chunk. Activation can now spread/flow along this connection, and this spreading activation is proportional to the weight W j (in symbols: ∝ W j ) of the connecting value.
Note that these values are themselves chunks, but we will continue to refer to them as values and explicitly call 'chunk' only the chunk in declarative memory that receives spreading activation, and the context chunk that is the source of the spreading activation.
We keep insisting that spreading activation is proportional to a weight W j (∝ W j ), but not identical to it, because chunk i in declarative memory does not simply add W j to its activation. Every weight W j , or source activation, is scaled by an associative strength S ji , and it is the product W j · S ji that gets added to the activation of chunk i.
Intuitively, we can think of this associative strength as the strength (or the resistance, if you will) of the connection between chunk i (the activation-receiving chunk in declarative memory) and the context chunk that is the source of spreading activation. Every value shared between chunk i in declarative memory and the context chunk 'creates' a connection along which activation W j can spread/flow, but this connection has a strength/resistance S ji specific to the value j that 'created' the connection and to the activation-receiving chunk i.
In our specific example, we have four weights/source activations W 1 , W 2 , W 3 , W 4 and four corresponding associative strengths S 1i , S 2i , S 3i , S 4i .
Associative, or 'connection', strength is basically a measure of how predictive any specific value in a context chunk is of chunk i. This idea of 'predictive' association will make more sense in a moment when we introduce the concept of fan, and will be further clarified in Chap. 8, where we discuss and model the classic fan experiment in Anderson (1974).
In our example, the form car has weight W 1 , but we don't simply add that to the base activation B i of our chunk i in declarative memory. Instead, we scale it by the associative strength S 1i , which is the strength of the connection created by the value car (which resides in the form slot of the imaginal buffer) and our chunk i (which resides in declarative memory, and which has the same value car in one of its slots). Thus, the activation boost spreading to chunk i in declarative memory from the value car in the imaginal buffer is given by the product W 1 S 1i .
The resulting total activation A i for any chunk i in declarative memory will therefore be the sum of its base activation B i (which reflects its past history of usage) and whatever spreading activation it gets from the cognitive context, which in our example is restricted to the imaginal buffer. When activation spreads along all four values of the imaginal chunk (that is, chunk i in declarative memory has all these four values in its slots), we have: How are we to set the weights and the associative strengths? One answer that immediately comes to mind is: empirically. We set some low-information/vague priors over the weights and the associative strengths and infer them from suitable experimental data. This is in fact what we will do in Chap. 8. For now, we will simply discuss some reasonable default values.
Every chunk in a buffer that spreads activation is assumed to have a total source activation W that gets evenly distributed among the values that reside in the slots of that chunk. W is by default set to 1. In our example, this would mean that W 1 = W 2 = W 3 = W 4 = 1 4 .
(12) Default value for source activation: W j = W n , where: • j goes from 1 to the number of slots n that carry a value; • W is by default set to 1.
Let's turn now to the associative strengths S ji , where i is the chunk in declarative memory that receives spreading activation, and j, which varies from 1 to n, is a value in the cognitive context buffer that spreads activation (this buffer has n slots that carry a value). For these associative strengths S ji , we want to capture the intuition that: • the association should be 0 if j does not associate with i (it is not predictive of i in any way), • it should be high if j uniquely associates with the chunk i (because j is then highly predictive of i), which would happen if there is no other chunk in declarative memory that is associated with j, and finally, • the association strength should decrease as more and more chunks in declarative memory are associated with j, since j becomes less predictive of any of these chunks.
This intuition is captured by the following formula (see Anderson and Schooler 1991; Anderson 2007): (13) S ji ≈ log prob(i| j) prob (i) In words, the strength of association between value j in a context buffer that spreads activation and chunk i in declarative memory that receives activation is approximately the log probability of needing chunk i from memory conditional on the fact that value j is present in the buffer, 'normalized' by the probability that chunk i is unconditionally needed.
Formally, this is the pointwise mutual information (pmi) between (i) the event that chunk i is needed/requested from declarative memory and (ii) the event that j is a value in the activation-spreading context buffer, i.e., a chunk in one of the slots of that context buffer: (14) Pointwise mutual information between two events i, j: As the definition in (14) above shows, pmi is a symmetric measure of association between single events (not an expectation, like mutual information). It can have both negative and positive values, and it is 0 if the events are independent. Understanding association strengths S ji in terms of the pmi between the declarative memory chunk i and the value j in the cognitive context makes intuitive sense: strength of association is a measure of how predictive the contextual value j is of the need to retrieve chunk i from memory. See also Appendix A.3 in Reitter et al. (2011) for a short discussion.
ACT-R has developed a way to estimate the values in (13). First, for cases in which i and j are not associated in any way, i.e., they are independent, the joint probability prob(i, j) is the product of the marginals prob(i)prob( j), so: To put it differently, if i and j are independent, the contextual value j is not predictive at all of the need to retrieve chunk i from declarative memory, so prob(i| j) = prob(i, j) = prob(i). Therefore, the contextual value j does not boost the activation of chunk i in any way, and to ensure that no activation spreads, we zero out the 'connection' strength S ji = log prob(i| j) prob(i) = log prob(i) prob(i) = log 1 = 0. If i and j are not independent, i.e., there is an association between them, so j has some predictive value with regards to the need-probability of chunk i, the common estimate for prob(i| j) is as follows: (16) prob(i| j) = 1 fan j , where: • fan j is the number of chunks associated with j in declarative memory, i.e., • fan j is the number of chunks that have j as a value in one of their slots.
The intuition behind this common estimate is that a value j in the cognitive context is equally predictive of any chunk in declarative memory that it is associated with. Basically, in the past, whenever we had value j in the cognitive context, we were equally likely to need to retrieve any of the declarative memory chunks associated with j. This kind of assumption is unrealistic in a naturalistic, 'ecologically valid' setting, but it is probably reasonable in the context of a counterbalanced experiment.
A common estimate for prob(i) is: (17) prob(i) = 1 |dm| , where: • |dm| is the size of declarative memory (dm): the number of chunks present in dm.
Again, this is extremely unrealistic since it assumes that all the chunks in declarative memory have the same history of past usage (or no history of past usage), so they have the same probability of being needed/retrieved. This estimate makes sense as a flat uniform prior used for convenience, perhaps in an experimental setting where frequentist and Bayesian posterior estimates of need probabilities for experimental items are intended to be identical.
With these two assumptions in place, associative strength S ji can be estimated as follows: (18) S ji = log |dm| fan j = log |dm| − log fan j .
Note how the dependency on a specific declarative memory chunk i disappears because of the (unrealistic) uniformity assumptions built into the prob(i| j) and prob(i) estimates.
It is hard to estimate the size of declarative memory, so the minuend log |dm| is often treated as a free (hyper)parameter S, with the requirement that S should be larger than log fan j , for any value j. If this was not so, association could be negative in some cases, i.e., in some cases, association strength would yield negative spreading activation, decreasing (inhibiting) base activation for some items rather than simply failing to boost it.
In sum, the final form for associative strength that is commonly used in ACT-R modeling is as follows: • S is the maximum associative strength, a free (hyper)parameter.
Putting this together, we arrive at the activation equation in (20). This shows how spreading activation from just one buffer affects the total activation of elements in declarative memory. Extending to more than one buffer is easy: we just sum up the spreading activation from all the buffers, as shown in (21).  This equation has the same three major components as the simpler one in (20) above. Differences: a. We sum over all buffers k, from 1 to n buffers in the cognitive context. b. The weights / source activations W k j are indexed with both the value j that is their source, as well as the buffer k where value j is located.
To understand this a little better, think of the typical scenario in which spreading activations, i.e., source activations scaled by associative strengths, are used. The values j we typically consider are values stored in the slots of the imaginal or the goal chunk. These buffers drive the cognitive process, so they provide a crucial part of the cognitive context in which we might want to retrieve items from memory.
When we have a goal or an imaginal chunk, we associatively bring to salience, i.e., spread activation to, chunks in declarative memory that are associated with the current imaginal or goal chunk, since these declarative memory chunks might be needed. We operationalize this 'association' between a chunk in the cognitive context and a chunk i in declarative memory in terms of the chunks being content identical (consisting of the same same set of slot-value pairs) or sharing some value j in some of their slots.
This essentially results in increasing the activation of those declarative memory chunks that are related to the current cognitive context, i.e., ultimately, that are potentially relevant to the current stage of the cognitive process we are involved in. The associative strength S ji is really the probability that chunk i is relevant given a cognitive context in which we attend to the value j.
One intuitive way to think about the activation of chunks in declarative memory and the additive relation between base activation and spreading activation is to imagine declarative memory was a sea of darkness with small rafts, i.e., chunks, floating everywhere on it. Each raft has a small light, and the brightness of that light indicates its total activation: the brighter that light is, the easier the raft is to find and grab-that is, we can retrieve it more accurately and more quickly.
The light on each raft is powered by two power sources. One of them is a rechargeable battery stored on the raft itself (well, it's more like a capacitor, but let's ignore this). This reflects base activation, i.e., the history of previous usages of a chunk. Every time we use a chunk (retrieve a raft), we plug its 'local battery' in for a quick charge. Immediately after that, the battery will have more power, so the light will be brighter.
The second source of power that can increase the brightness of the light on a raft is the current cognitive context, specifically the values held in the buffers. If these values are also stored on some of the rafts in declarative memory (that is, they are the values of some of the features of those chunks), they can act as wires delivering extra power to the lights on the rafts.
Let's focus on a specific chunk in some buffer in our cognitive context. Each value j in that chunk has a set amount of battery power (these are the source activations, i.e., the W j values), and that power gets distributed to all the rafts in declarative memory that also store that value. This immediately predicts that the more rafts a value in the cognitive context is connected with-in ACT-R parlance, the higher the 'fan' of a value-, the less power it will transmit to each individual raft. This 'fan effect' is discussed in detail in Chap. 8.
The amount of power/activation that 'spreads' from the goal/imaginal buffer (or any buffer in the cognitive context that we decide to spread activation from) depends not only on the 'battery power' W j of each value j, but also on the specific 'wires' connecting the buffer and the rafts/chunks in declarative memory that share that value. Different wires have different 'resistance' characteristics S ji , and the extra power boost W j is modulated by the 'resistance'/strength of the connection.
Let us go through an example. Suppose we have the word car in the imaginal buffer and our declarative memory consists of two chunks x and y that are singular nouns (say, book and pen) and one chunk z that is a plural noun (say, books). The singular nouns x, y have two values in common with the car chunk in the imaginal buffer, namely the singular number and the noun category. The plural noun z has only one value in common with the car chunk, namely, the noun category.
The activation of the plural noun z is calculated in (22) below. Recall that j are the values of the car chunk in the imaginal buffer, and j = 1 for the form slot, j = 2 for the meaning slot, j = 3 for the syntactic category slot and j = 4 for the number morphology slot. Since there are 4 total slots in the imaginal buffer chunk, the source activations are all set to 1 4 (see (12) above, with W = 1, n = 4). Turning to association strengths, we note that only the value in the syntactic category slot (i.e., noun/N) spreads activation. This means that the association strengths for all the other values are zero, i.e., S 1z = S 2z = S 4z = 0. The fan of the value in the syntactic category slot, namely fan 3 , is 4, since this value is N and there are four nouns total in declarative memory: x, y, z and, we assume, also the car chunk currently in the imaginal buffer.
The calculation, therefore, proceeds as follows: 4 · (S − log 4) In contrast, the activation of the singular noun x proceeds as shown in (23) below. This time, the singular receives spreading activation both from the syntactic category value (N) and from the number specification (sg).
The activation spreading from the number value is higher than the one spreading from the syntactic category value because there are 4 nouns total (fan 3 = 4), but only 3 of them are singular (fan 4 = 3). This makes intuitive sense: values that appear only in handful of chunks are more predictive of these chunks and should boost their activation more than values that are more frequent and, therefore, less discriminatory.

Activation, Retrieval Probability and Retrieval Latency
Now that we have a formal model of activation with its two components, namely: • base activation that encodes the effects of prior use on memory, and • spreading activation that encodes contextual effects we can turn to how we can predict human performance based on activation. 5 Recall that the two behavioral measures we are trying to predict are (i) the probability of selecting a particular response, specifically of retrieving a chunk from memory, and (ii) the latency of selecting a response, specifically the time taken by the retrieval process. The relevant equations are provided in (24) and (25) below.
(24) Probability of retrieval equation: • s is the noise parameter and is typically set to about 0.4 • τ is the retrieval threshld, i.e., the activation at which we have a chancelevel (0.5) retrieval probability 6 (25) Latency of retrieval equation: • F is the latency factor (basically, an intercept on log time scale) • f is the latency exponent (a slope on log time scale) 7 In addition to the activation A i of chunk i, these equations have a few parameters. The threshold parameter τ in the probability of retrieval equation (24) and the latency factor F in the (25) vary from model to model, but there is a general relationship between them, provided in (26).
(26) F ≈ 0.35e τ , i.e., the retrieval latency at threshold is approximately 0.35 s/350 ms 8 To understand the two equations in (24) and (25) a bit better, recall our discussion of base activation, and the important remark that activation A i is really the log of the need-odds of a particular chunk i. That is, A i is the logit (log odds) transformation of the need-probability of chunk i. 5 Note that these two components of activation have roles similar to the model M and the variable assignment g parameters of the interpretation function · M,g in formal semantics. The model M is parallel to base activation as it encodes more permanent, context-invariant information, while the variable assignment g is parallel to spreading activation as it encodes contextually-sensitive information of a more transient nature. 6 When A i = τ , we have: 1+e 0 = 1 2 . 7 On log time scale, we have: log T i = log(Fe − f A i ) = log F − f A i . 8 When A i = τ and f is set to its default value of 1, we have: The need-probability of chunk i is just the probability that chunk i is the one needed to satisfy the current memory retrieval request. Talking about needprobability is interchangeable with talking about need-odds, or activation, which is just need-logits.
In (27) below, we show how we can compute need-odds and activation (needlogits) if we are given the need-probability p i of chunk i. In (28), we show how we can compute need-odds and need-probability if we are given the activation A i (i.e., we are given the need-logits A i ) for chunk i.
(27) Given the need-probability p i for chunk i, we have: (28) Given the activation / need-logits A i for chunk i, we have: The very last equation in (28) shows how to compute need-probability p i when we are given activation (need-logits) A i . But this is exactly the equation we used to obtain probability of retrieval in (24) above. The only difference is that, in (24), we add two parameters, the threshold τ and the noise s, which enable us to make the model more realistic/flexible so that we can fit different kinds of data well.
Thus, the probability of retrieval equation immediately follows from the fact that we take activation to encode the log-odds (logit) transformation of need-probability.
The latency of retrieval equation in (25) is equally intuitive. Ignoring the parameters F and f , i.e., setting them to 1 (incidentally, 1 is the default value for f ), we have that: That is, the retrieval latency for chunk i is inversely proportional to the need-odds of i. The higher the need-odds for chunk i, the less time it will take to retrieve it. The lower the need-odds for chunk i, the more time it will take to retrieve it.
The need odds for a chunk i are high if the chunk has been used a lot and/or recently (this comes from base activation), and the chunk is highly relevant given the current cognitive context (this comes from spreading activation). It therefore makes sense that such a chunk would be easy/fast to retrieve: it was retrieved a lot and/or recently, and it is strongly associated with what the cognitive process is currently attending to.
Finally, modeling retrieval times/latencies as inversely proportional to need-odds makes mathematical sense. While probabilities take values in the interval [0, 1], odds take values in the interval [0, ∞), i.e., in the set of positive real numbers (and 0), which is the same interval in which reaction times/latencies also take values.
Let's now work through an example. We will plot the probability and latency of retrieval for the same hypothetical case as the one in Fig. 6.5 above, assuming the activation of the carLexeme chunk under consideration is just its base-level activation. We set the parameters as follows: Note that according to the equation in (26), F ≈ 0.35e 0.3 ≈ 0.35 × 1.35 ≈ 0.47 s, which is what we set our F value to. 9 Of course, we pulled the values for these parameters out of thin air for this particular example. In general, we want to use statistical inference (e.g., Bayesian methods with pymc3) to estimate these parameters from the data, and we will do exactly this when we model lexical decision tasks in the next chapter (Chap. 7). But we set the parameters to these (more or less default) values for the current example.
In [py23], we use the previously computed vector base_act of base activations to compute the probabilities of retrieval and the latencies of retrieval for the 10 s we are interested in. We then plot these three curves (activation, retrieval probability, retrieval latency). The code for the plot is provided in [py24], and the plots are provided in Fig. 6.6. >>> # latency of retrieval in ms 10 >>> (latency_retrieval * 1000).astype("int") 11 array ([470, 14, 21, ..., 251, 251, 251 In Fig. 6.6, we plot the threshold τ as an interrupted black line in every plot: • in the top panel, we plot its raw value (0.3) on the activation (logit) scale; • in the middle panel, the threshold is at 50% probability, as intended given that activation at threshold level should yield even odds of retrieval (1/1; chance level); • in the bottom panel, the threshold is at about 350 ms, which is the time of retrieval when activation is at threshold level; this is actually determined by the constant 0.35 (350 ms) we used in (26). Figure 6.6 shows that, as activation increases above threshold after the fourth and fifth presentation/rehearsal of the carLexeme chunk (see the top panel in the figure), retrieval accuracy increases above chance (see the plot in the middle panel) and the retrieval becomes faster and faster (retrieval time decreases below 340 ms; see plot in bottom panel).
Remarkably, the ACT-R account of declarative memory unifies two separate measures (retrieval accuracy and retrieval latency) under one quantity: activation. Furthermore, activation can be independently derived if we know, or can reasonable conjecture: • the pattern of previous use for a chunk-this will give us the base activation component; • the cognitive context-this will give us the spreading activation component.
In the following chapter, we will apply this model to lexical access and we will evaluate how well the ACT-R model of memory accounts for both accuracy and latency in lexical decision tasks, as well as latencies in self-paced reading experiments.

Appendix
All the code discussed in this chapter is available on GitHub as part of the repository https://github.com/abrsvn/pyactr-book. If you want to examine it and run it, install pyactr (seee Chap. 1), download the files and run them the same way as any other Python script.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.