Competence-Performance Models for Lexical Access and Syntactic Parsing

In Chap. 4, we introduced a simple lexical decision task and a simple left-corner parser. The models we introduced in that chapter might be sufficient with respect to the way they simulate interactions with the environment, but they are too simplistic in their assumptions about memory, since memory retrievals are not dependent on any parameters of the retrieved word. In this chapter, we will improve on both models by incorporating the ACT-R model of declarative memory we just introduced in the previous chapter.

collected responses and response times in a lexical decision task using words from 16 frequency bands, summarized in Table 7.1. 1 Using the RT latencies from Murray and Forster (2004), let us build a logfrequency model and evaluate the discrepancies between the predictions of the model and the data. We first store the data in two variables freq (mean frequency) and rt (reaction time/latency; measured in s) ( Fig. 7.1).

19
We can now plot the estimates of the log-frequency model: [py27] >>> mu = trace ["mu"]  The plots show that the log-frequency model gets the middle values right, but it tends to underestimate the amount of time needed to access words in the extreme frequency bands-both low frequency (associated with high RTs) and high frequency (associated with low RTs). Murray and Forster (2004) take this as an argument for a specific information retrieval mechanism, the Rank Hypothesis (see Forster 1976Forster , 1992, but as they note, other models of retrieval could similarly improve data fit. One such model treats frequency effects as practiced memory retrieval, which is commonly assumed to be a power function of time in the same way that memory performance is (Newell and Rosenbloom 1981;Anderson 1982;Logan 1990).

The Simplest ACT-R Model of Lexical Decision
Practiced memory retrieval in ACT-R crucially relies on the power-function model of declarative memory. The power function is used to compute (base) activation based on the number of practice trials/'rehearsals' of a word (see (5) in Chap. 6), which in turn is used to compute latency and accuracy for retrieval processes (see (25) and (24) in Chap. 6).
For any word, the number of rehearsals that contribute to its base activation are crucially determined by its frequency. There are other factors that determine the number and timing of the rehearsals, but we will assume a simple model here: the number of rehearsals is exclusively determined by frequency. We will also assume, for simplicity, that presentations of a word are linearly spaced in time.
To be specific, let's consider a 15-year old speaker. How can we estimate the time points at which a word was used in language interactions that the speaker participated in? Once we know these time points, we can compute the base activation for that word, which in turn will make predictions about retrieval latency and retrieval accuracy that we can check against the Murray and Forster (2004) data in Table 7.1.
We know the lifetime of the speaker (15 years), so if we know the total number of words an average 15-year old speaker has been exposed to, we can easily calculate how many times a particular word was used on average, based on its frequency. Once we find out how many times a word with a specific frequency was presented to our speaker during their lifetime, we can then present the word at linearly spaced intervals during the life span of the speaker (we use linearly spaced intervals for simplicity).
A good approximation of the number of words a speaker is exposed to per year can be found in Hart and Risley (1995). Based on recordings of 42 families, Hart and Risley estimate that children comprehend between 10 million and 35 million words a year, depending to a large extent on the social class of the family. This amount increases linearly with age.
According to the Hart and Risley (1995) study, a 15-year old has been exposed to anywhere between 50 and 175 million words total. For simplicity, let's use the mean of 112.5 million words as the total amount of words a 15-year old speaker has been exposed to. This is a very conservative estimate because we ignore production, as well as the linguistic exposure associated with mass media.
The roughness of our estimate is not an issue for our purposes since we are interested in the relative effect of frequency, not its absolute effect. We do not want to predict how much time the retrieval of a word from one frequency band requires, but how much time a word requires compared to a word from another frequency band.
In [py28] below, we first compute the number of seconds in a year, and then the total number of seconds in the life span of the 15-year old speaker we're modeling (lines 1-2). The function time_freq defined on lines 3-9 takes the mean frequency vector freq in [py25] above and generates a schedule of linearly spaced word rehearsals/presentations for words from the 16 frequency bands studied by Murray

11
On line 4 in [py28], we initialize our rehearsal schedule in the matrix rehearsals. This matrix has as many rows as the number of rehearsals for the most frequent word band: np.max(freq) gives us the maximum frequency in words per million, which we multiply by 112.5 million words (the total number of words our 15-year old speaker has been exposed to). The rehearsals matrix has 16 columns: as many columns as the frequency bands we are interested in.
The for loop on lines 6-9 in [py28] iterates over the 16 frequency bands and, for each frequency band, it does the following. On line 7, we identify the total number of rehearsals for frequency band i throughout the life span of the speaker (freq[i]*112.5) and generate a vector with as many positions as there are rehearsals. On line 8, at each position in that vector, we store the time of a rehearsal in seconds. The result is a vector temp of linearly spaced rehearsal times that we store in our full rehearsals matrix (line 9).
These rehearsal times can also be viewed as time periods since rehearsals if we reverse the vector (recall that we need time periods since rehearsals when we compute base activation). But we don't need to actually reverse the vector since we will have to sum the time periods to compute activation, and summation is commutative.
Finally, we return the full rehearsals matrix on line 10, in transposed form because of the way we will use it to compute base activation (see below).
With this function in hand, we compute a rehearsal schedule for all 16 frequency bands on line 3 of [py29] below. We store the matrix in a theano variable called time. The theano library, which we import on lines 1-2, enables us to do computations with multi-dimensional arrays efficiently, and provides the computational backbone for the Bayesian modeling library pymc3. We need to access it directly to be able to compute activations from the rehearsal schedule stored in the time variable.

38
Computing activations from the rehearsal schedule requires us to define a separate function compute_activation-see lines 13-17 in [py30]. This function assumes that the matrix scaled_time has been computed (line 12): to compute this matrix, we take our rehearsal schedule stored in the time matrix (time periods since rehearsals for all frequency bands) and raise these time periods to the -decay power. The result is a matrix that stores scaled times, i.e., the base activation boost contributed by each individual word rehearsal for all frequency bands.
Some of the values in the time matrix were 0. In the scaled_time matrix, they become infinity. When we compute final activations, we want to discard all these infinity values, which is what the compute_activation function does. It takes the 16 vectors of scaled times (for the 16 frequency bands) as inputs one at a time. Then, it identifies all the infinity values (line 14). Then, it extracts the subvector of the input vector that contains only non-infinity values (line 15). With this subvector in hand, we can sum the scaled times and take the log of the result to obtain our final activation value (line 16), which the function returns to us.
With the function compute_activation in hand, we need to iterate over the 16 vectors of scaled times for our 16 frequency bands and compute the 16 resulting activations by applying the compute_activation function to each of these vectors of scaled times. However, theano-based programming is purely functional, which means there are no for loops. We therefore use the theano.scan method on lines 18-19 of [py30] to iteratively apply the compute_activation function to each of the 16 vectors in the scaled_time matrix. 2 The likelihoods of the lexical decision model in [py30] (lines 21-24 and 26-29 in [py30]) are direct implementations of the retrieval latency and retrieval probability equations in (25) and (24). We omit the latency exponent in the latency likelihood (see mu_rt on lines 21-22) because we assume it is set to its default value of 1. We will see that this value is not appropriate, so we will have to move to a model in which the latency exponent is also fully modeled.
Note that the dispersions around the mean RTs and mean probabilities are very minimal-we set the standard deviations on lines 23 and 28 to 0.01. The reason is that our observed values for both RTs and accuracy are not raw values-they are already means, namely, the empirical means for the 16 frequency bands reported in Murray and Forster (2004). As such, we assume these means are very precise reflections of the underlying parameter values.
We could model these standard deviations explicitly, but we decided not to since we have only 32 observations here (16 for RTs, 16 for accuracies), and we are trying to estimate a fairly large number of parameters already: decay, intercept, latency_factor, noise and threshold. Low information priors for these parameters are specified on lines 4, 6-7 and 9-10 in [py30]. 3 The only new parameter in this model relative to the ACT-R probability and latency equations in (24) and (25) is the intercept parameter we use in the latency likelihood (line 21 in [py30]). The intercept is supposed to absorb the time in the lexical decision task associated with operations other than memory retrieval: focusing visual attention, motor planning etc.
With the model fully specified, we sample from the posterior distributions of the parameters. Once we obtain the samples, we are ready to plot them to evaluate how well the model fits the data (we take the first 2000 samples to be the burn-in and drop them; see, for example, Kruschke (2011) for more discussion of burn-in, thinning etc.). The code for the plots is provided in [py31] and the resulting plots are shown in Fig. 7 (1996). Purely functional programming languages (Haskell is probably the most well-known example nowadays) should be easy to understand for formal semanticists, given their familiarity with λ-calculus. 3 By low information priors, we mean priors that do not assign high probability to particular narrow regions in the parameter space. If some regions are less probable than others, that is because their lower probability can be determined from considerations that are independent of experimental data we are trying to model.  An important thing to note about the ACT-R lexical decision model is that predictions about latencies and probabilities are theoretically connected: base activation is an essential ingredient in predicting both of them. Thus, we are not proceeding purely in an inductive fashion here by looking at the RT data on one hand, the accuracy data on the other hand, and then drawing theoretical conclusions from the data in an informal way, i.e., in a way that is suggestive and possibly productive, but ultimately vague and incapable of making precise predictions.
Instead, our mathematical model takes a core theoretical concept (base activation as a function of word frequency) and connects it in a mathematically explicit way to latency and accuracy. Furthermore, our computational model directly implements the mathematical model, and enables us to fit it to the experimentally obtained latency and accuracy data.
In addition to connecting distinct kinds of observable behavior via the same unobservable theoretical construct(s), a hallmark of a good scientific theory is that it is falsifiable. And the plots in Fig. 7.2 show that an ACT-R model of lexical decision that sets the latency exponent to its default value of 1 (in effect omitting it) is empirically inadequate.
The bottom plot in Fig. 7.2 shows that our lexical decision model does a good job of modeling retrieval probabilities. The predicted probabilities are very close to the observed ones, and they are precisely estimated (there are very few visible error bars protruding out of the plotted points).
In contrast, latencies are poorly modeled, as the top plot in Fig. 7.2 shows. The predicted RTs are not very close to the observed RTs, and our model is very confident in its incorrect predictions (error bars are barely visible for most predicted RTs).

The Second ACT-R Model of Lexical Decision: Adding the Latency Exponent
Our ACT-R lexical decision model without a latency exponent does not provide a satisfactory fit to the Murray and Forster (2004) latency data. In fact, the log-frequency model is both simpler (although less theoretically motivated) and empirically more adequate.
We therefore move to a model that is minimally enriched by explicitly modeling the latency exponent. The usefulness of the latency exponent in modeling reaction time data has been independently noted in the recent literature-see, for example, West et al. (2010).
The code for the model is provided in [py32] below. The only additions are (i) the half-normal prior for the latency exponent on line 10 and (ii) its corresponding addition to the latency likelihood on line 25. Note that we use a different method to sample the posterior for this model ( We see that the lexical decision model that explicitly models the latency exponent fits both latencies and probabilities very well. The latencies, in particular, are modeled better than both the lexical decision model without a latency exponent, and the logfrequency model, which did not have a very good fit to the data at the extreme frequency bands (low or high frequencies).
We list below the estimated posterior mean and 95% credible interval (CRI) for the latency exponent: the posterior mean value and the CRI are pretty far away from the default value of 1 we assumed in the previous model.

Bayes+ACT-R: Quantitative Comparison for Qualitative Theories
In this subsection, we show how we can embed ACT-R models implemented in pyactr into Bayesian models implemented in pymc3. This embedding opens the way towards doing quantitative comparison based on experimental data for both subsymbolic and symbolic theories. That is, this Bayes+ACT-R combination enables us to do quantitative theory comparison. We will be able to take our symbolic theories that make claims about competence, for example, Discourse Representation Theory (DRT; Kamp 1981; Kamp and Reyle 1993), as we will see in Chaps. 8 and 9, embed them in larger performance theories that have a detailed processing component, and then compare different theories quantitatively based on behavioral data of the kind commonly collected in psycholinguistics.
In this section, we introduce the basics of our Bayes+ACT-R framework by considering and comparing three models for lexical decision tasks: i. the first model builds on the simple ACT-R/pyactr lexical decision model introduced in Chap. 4; we will show how that model can be used as part of the likelihood function of a larger Bayesian model; ii. the second model is cognitively more realistic than the first one: it makes use of the imaginal buffer as an intermediary between the visual module and the declarative memory module; we set the delay for imaginal-buffer encoding to its default value of 200 ms; once again, this ACT-R/pyactr model will provide part of the likelihood component of a larger Bayesian model; iii. the third and final model is cognitively more realistic since it makes use of the imaginal buffer, just as the second model, but we set the encoding delay of the imaginal buffer to the non-default value of 0 ms 5 ; this is the imaginal-buffer delay we needed when we implemented our left-corner parser in Chap. 4; just as before, the ACT-R/pyactr model is embedded in a larger Bayesian model, for which it provides part of the likelihood function.
The first model (i) without the imaginal buffer and the other two models (ii) and (iii) with imaginal buffers differ with respect to a symbolic (qualitative) component. In this particular case, the symbolic component (imaginal-buffer usage or lack thereof) belongs to the processing part of the symbolic (non-quantitative) theory, but theoretical differences at the 'core' competence level can (and will) be similarly compared.
In contrast, the last two models (ii) and (iii) differ with respect to specific conjectures about a subsymbolic (quantitative) component, namely the average time to encode chunks in the imaginal buffer.
Our Bayes+ACT-R framework enables us to compare all these models. This comparison is not only qualitative. The models can be quantitatively evaluated and compared relative to specific experimental data. Since pyactr enables us to embed ACT-R models as the likelihood component of larger Bayesian models built with pymc3, we can do statistical inference over the subsymbolic parameters of our ACT-R model in the standard Bayesian way, rather than trying different values one at a time and manually identifying the best fitting ones.
We'll therefore be able to identify the standard measures of central tendency (posterior means, but also medians or modes if needed), as well as compute credible intervals for every parameter of interest. The Bayesian framework will furthermore enable us to conduct unrestricted model comparison (using Bayes factors or WAIC, for example), unlike maximum likelihood methods-see the discussion at the end of this section and in Sect. 7.5.
Throughout this book, whenever we embed an ACT-R model in a Bayesian model, we turn off all the non-deterministic (stochastic) components of the ACT-R model other than the ones for which we are estimating parameters. This effectively turns the ACT-R model into a complex, but deterministic, function of the parameters, which we can straightforwardly incorporate as a component of the likelihood function of the Bayesian model.
For more realistic simulations, we would have to turn on various non-deterministic components of the ACT-R model (e.g., noise associated with visual module), in which case we would have to resort to Approximate Bayesian Computation (ABC; see Sisson et al. (2019) and references therein) to incorporate an approximation of the ACT-R induced likelihood into our Bayesian model. ABC is beyond the scope of this book, but it is a very promising direction for future research, and a central issue to be addressed as more linguistically sophisticated cognitive models are developed.

The Bayes+ACT-R Lexical Decision Model Without the Imaginal Buffer
The link to the full code for this model is provided in the appendix to this chaptersee Sect. 7.7.1. We will only discuss here the most important and novel aspects of this Bayes+ACT-R model combination. We first initialize the model under the variable lex_decision and declare its goal buffer to be g.
[py36] >>> import pyactr as actr We set up the data: see the FREQ, RT and ACCURACY variables in [py37] below (recall that pyactr measures time in s, not ms, so we divide the RTs by 1000). We then generate the presentations times for the 16 word-frequency bands considered in Murray and Forster (2004): see FREQ_DICT and the theano variable time.
[py37] >>> FREQ = np.array ([242, 92.8, 57.7, 40.5, 30.6, 23.4, 19, 1 ... 16, 13.4, 11.5, 10, 9, 7, 5 We are now ready to build the procedural core of our model. The production rules are the same as the ones we introduced and discussed in Chap. 4, listed for ease of reference in [py38] below: • the "attend word" rule takes a visual location encoded in the visual where buffer and issues a command to the visual what buffer to move attention to that visual location; • the "retrieving" rule takes the visual value/content discovered at that visual location, which is a potential word form, and places a declarative memory request to retrieve a word with that form; • finally, the "lexeme retrieved" and "no lexeme found" rules take care of the two possible outcomes of the memory retrieval request: if a word with that form is retrieved from memory ("lexeme retrieved"), a command is issued to the motor module to press the 'J' key; if no word is retrieved ("no lexeme found"), a command is issued to the motor module to press the 'F' key.
[py38] >>> lex_decision.productionstring(name="attend word", string=""" With the production rules in place, we can start preparing the way towards embedding the ACT-R model into a Bayesian model. The main idea is that we will use the ACT-R model as the likelihood component of the Bayesian model for lexical-decision latencies/RTs. Specifically, we will feed parameter values for the latency factor lf, latency exponent le and decay into the ACT-R model, run the model with these parameters for words from all 16 frequency bands, and collect the resulting RTs. The Bayesian model will then use these RTs to sample new values for the lf, le and decay parameters in proportion to how well the RTs generated by the ACT-R model agree with the experimentally collected RTs (and the diffuse priors over these parameters).
The first function we need is run_stimulus(word) in [py39] below. This function takes a word from one of the 16 frequency bands as its argument and runs one instance of the ACT-R lexical decision model for that word. To do that, we first reset the model to its initial state: we flush buffers without moving their contents to declarative memory (lines 2-9 in [py39]), we set the word argument as the new stimulus (line 10), we initialize the goal buffer g with the "start" chunk (lines 11-13), and we initialize the lexical decision simulation (lines 14-19).
At this point, we run a while loop that steps through the simulation until a lexical decision is made by pressing the 'J' or 'F' key, at which point we record the time of the decision in the variable estimated_time (set to −1 if the word was not retrieved), 6 exit the while loop and return the estimated RT (lines 20-28).
The second function run_lex_decision_task() in [py39] runs a full lexical decision task by calling the run_stimulus(word) function for words from all 16 frequency bands (lines 31-33). The function returns the vector of estimated lexical decision RTs for all these words (line 34).

35
With the run_lex_decision_task() function in hand, we only need to be able to interface the ACT-R model implemented in pyactr with a Bayesian model implemented in pymc3 (and theano). This is what the function actrmodel_ latency in [py40] below does. This function runs the entire lexical decision task for specific values of the latency factor lf, latency exponent le and decay parameters provided by the Bayesian model (which will be discussed below). The activation computed by theano with the same value of the decay argument is also passed as a separate argument activation_from_time to save (a significant amount of) computation time.
The actrmodel_latency function takes these four parameter values as arguments (line 3 in [py40]), initializes the lexical decision model with them (lines 4-9), runs the lexical decision task with these model parameters (line 10) and returns the resulting vector of RTs (line 11). The entire function is wrapped inside the theanoprovided decorator @as_op (lines 1-2), 7 which enables theano and pymc3 to use the actrmodel_latency function as if it was a native theano/pymc3 function. The only thing the @as_op decorator needs is data-type declarations for the arguments of the actrmodel_latency function (lf, le and decay are scalars, while activation_from_time is a vector-line 1) and for its value (which is a vector-line 2

12
We are now ready to use the actrmodel_latency function as the likelihood function for latencies in a Bayesian model very similar to the ones we already discussed in this chapter. The model is specified in [py41] below. The prior for the decay parameter is uniform (line 3), the priors for the lexical-decision accuracy parameters noise and threshold are uniform and normal (lines 4-5), and the priors for the lexical-decision latency parameters lf and le are both half-normal (lines 6-7).
We then compute activation based on word frequency in the same way we did before (lines 8-15), after which we specify the likelihood function for lexicaldecision latency (lines 16-19), which crucially uses the ACT-R model via the actrmodel_latency function (line 16), and the likelihood function for lexicaldecision accuracy (lines 20-23). Note that the accuracy is computed independently of the latency, which simplifies the workings of the pyactr model (as we already indicated, we can assume that the pyactr model recalls all words successfully).

24
The Bayesian model is schematically represented in Fig. 7.4 (following the type of figures introduced in Kruschke 2011).
The plots in Fig. 7.5 show that the Bayes+ACT-R model without any imaginalbuffer involvement has a very good fit to both the latency and the accuracy data. The  For reference, we provide the Gelman-Rubin diagnostic (a.k.a. Rhat/R) for this model in [py42] below. As Gelman and Hill (2007, 352)  However, this model oversimplifies the process of encoding visually retrieved data. We assume that the visual value found at a particular visual location is immediately shuttled to the retrieval buffer to place a declarative memory request -see the productions "attend word" and "retrieving" in [py38] above.
This disregards the cognitively-motivated ACT-R assumption that transfers between the visual what buffer and the retrieval buffer are mediated by the goal or imaginal buffers. Cognition in ACT-R is goal-driven, so any important step in a cognitive process should be crucially driven by the chunks stored in the goal and/or imaginal buffers.

Bayes+ACT-R Lexical Decision with Imaginal-Buffer Involvement and Default Encoding Delay for the Imaginal Buffer
We now turn to the examination of the first of two alternative Bayes+ACT-R models, both of which crucially involve the imaginal buffer as an intermediary between the visual what and retrieval buffers. The full code for the model discussed in this subsection is available at the link provided in the appendix to this chapter-see Sect. 7.7.1. The Bayesian model remains the same, the only part we change is the ACT-R likelihood for latencies. Specifically, we modify the procedural core of the ACT-R model as shown in [py43] below. We first add the imaginal buffer to the model (line 1 in [py43]), and then replace the "attend word" and "retrieving" rules with three rules "attend word" (lines 4-21), "encoding word" (lines 28-42) and "retrieving" (48-62).
The new rule "encoding word" mediates between "attend word" and "retrieving". The visual value retrieved by the "attend word" rule is shifted into the imaginal buffer by the "encoding word" rule. Then, the "retrieving" rule takes that value, i.e., word form, from the imaginal buffer and places it into the retrieval buffer.
The top plot in Fig. 7.6 shows that the model has a very poor fit to the latency data. Adding the imaginal-buffer mediated encoding step adds 200 ms to every lexical decision simulation, since 200 ms is the default ACT-R delay for chunk-encoding into the imaginal buffer.
We therefore see that the predicted latencies for all 16 word-frequency bands are greatly overestimated (they are far above the red diagonal line). The model with the imaginal buffer cannot run faster than about 725 ms, at least not when the imaginalbuffer encoding delay is left at its default 200 ms value.
We can think of the 200 ms imaginal delay as part of the baseline intercept for our ACT-R model. The intercept is simply too high to fit high-frequency words, for which the lexical decision task should take 100 to 200 ms less than this intercept.
The Rhat values for this model are once again below 1.1 (in fact, they are very close to 1): (1) We see here one of the main benefits of our Bayes+ACT-R framework. We are able to fit any model to experimental data, and we are able to compute quantitative predictions (means and credible intervals) for any model. We are therefore able to quantitatively compare our qualitative theories.
In this particular case, we see that a model that is cognitively more plausible fares more poorly than a simpler, less realistic model.

Bayes+ACT-R Lexical Decision with Imaginal Buffer and 0 Delay
We will now improve the imaginal-buffer model introduced in the previous subsection by setting the imaginal delay to 0 ms, instead of its default 200 ms value. When we built our left-corner parser in Chap. 4, we already saw that the imaginal delay might need to be decreased if we want to have empirically-adequate models of linguistic phenomena. This is because natural language interpretation involves incremental construction of rich hierarchical representations that seriously exceed in complexity the representations needed to model other high-level cognitive processes in ACT-R. The full code for the model discussed in this subsection is once again available at the link provided in the appendix to this chapter-see Sect. 7.7.1. The only change relative to the model in the previous subsection is setting the delay for the imaginal buffer to 0 when the model is reset to its initial state in the run_stimulus(word) function. The resulting predictions are plotted against the observed data in Fig. 7.7. We see here that, once the latency 'intercept' of the ACT-R model is suitably lowered by removing the imaginal-encoding delay, a cognitively plausible model that makes crucial use of the imaginal buffer can fit the data very well.
The Rhat values for this model are also below 1.1: (2) We now have a formally explicit way to connect competence-level theories to experimental data via explicit processing models. That is, we can formally, explicitly connect qualitative (symbolic, competence-level) theory constructionthe main business of the generative grammarian-and quantitative (subsymbolic, performance-level) statistical inference and model comparison based on experimentally collected behavioral data-the main business of the experimental linguist.
Traditionally, these are separate activities that are only connected informally. The fundamental vagueness of this informal connection is intrinsically unsatisfactory. But, in addition to that, this vagueness encourages the generative grammarian and the experimental linguist to work in separate spheres, with the generative grammarian developing sophisticated theories with a relatively weak empirical basis, and the experimental linguist often using an informal, overly simplified theory that can fit in the Procrustean bed of a multi-way ANOVA (or similar linear models).
There are several reasons for embedding ACT-R models in Bayesian models for statistical inference, rather than just using maximum likelihood estimation. These reasons are not specific to ACT-R models, but are brought into sharper relief by the complexity of these models relative to the generalized linear models standardly used in (psycho)linguistics.
The first reason is that we can put information from previous ACT-R work into the priors. Most importantly, however, the Bayesian framework enable us to perform generalized model comparison (via Bayes factors, or using other criteria). In contrast, maximum likelihood model comparison fails for models for which we cannot estimate the number of parameters in the model. Estimating the number of parameters is already difficult for models with random effects. For hybrid symbolic-subsymbolic models like the ACT-R ones we have been constructing, the question of identifying the "number" of parameters is not even well-formed.
For a distinct line of argumentation that the integration of ACT-R and Bayesian models is a worthwhile endeavor, see Weaver (2008).

Modeling Self-paced Reading with a Left-Corner Parser
Apart from the lexical decision model, Chap. 4 also showed how to implement a left-corner parser model in ACT-R/pyactr. We noted in that chapter that the parsing model was not realistic due to, among other things, its simplifying assumption that memory retrievals of lexical information always take a fixed amount of time, irrespective of the specific state and properties of the components of the recall process (the specific recall cue, the state of declarative memory, the contents of the other buffers etc.). Since we now have a more realistic model of lexical access at our disposal, we might want to investigate whether this model could also be used to improve our parsing model.
We can go even further than that: one interesting property of ACT-R is that it assumes one model of retrieval irrespective of the cognitive sub-domain under consideration. We can therefore ask how this model of retrieval fares with respect to language: can we model both syntactic and lexical retrieval using the same mechanisms and the same parameter values within one ACT-R model? ACT-R/pyactr left-corner parsing models addressing these questions were introduced and discussed in Brasoveanu and Dotlačil (2018). In this section, we will summarize the main points of that work.
Brasoveanu and Dotlačil (2018) studied the fit of the parser to human data by simulating Experiment 1 in Grodner and Gibson (2005). 9 This is a self-paced reading experiment (non-cumulative moving-window;Just et al. 1982): participants read sentences that do not appear as a whole on the screen. Rather, they have to press a key to reveal the first word, and with every key press, the previous word disappears and the next word appears. What is measured (and modeled) is how much time people spend on each word.
The modeled self-paced reading experiment has two conditions. In one condition, the subject noun phrase is modified by a subject-gap relative clause-see (3) below. In the second condition, the subject noun phrase is modified by an object-gap relative clause-see (4) below.
Using relative clauses is crucial, since this allows us to study the properties of syntactic retrieval. At the gap site, indicated as t i in (3/4) below, the parser has to retrieve the wh-word from declarative memory to correctly interpret the relative clause. Studying the reading-time profiles of these sentences can therefore help us understand the latencies of both lexical and syntactic recall.
(3) The reporter who i t i sent the photographer to the editor hoped for a story.
(4) The reporter who i the photographer sent t i to the editor hoped for a story.
Just as when we modeled lexical decision, Brasoveanu and Dotlačil (2018) built more than one model and quantitatively compared them. This comparison is a necessary part of developing good ACT-R models, and cognitive models in general. 10 But more importantly, it enables us to gain insight into underlying (unobservable) cognitive mechanisms by identifying the better fitting model(s) in specific ROIs.
In total, three models were created to simulate self-paced reading and parsing. All three models were extensions of the eager left-corner parser described in Sect. 4.4. The two main modifications were: (i) the parser was extended with a more realistic model of lexical access, the same as the one used in the second ACT-R model for lexical decision in this chapter (see Sect. 7.3), and (ii) the parser had to recall the wh-word in the relative clause to correctly parse it. The parser incorporated visual and motor modules, just like the one in Chap. 4.
The three models differ from each other in two respects. First, Models 1 and 2 assume a slightly different order of information processing than Model 3. Models 1 and 2 are designed in a strongly serial fashion: • first, a word w is attended visually; • after that, its lexical information is retrieved, and syntactic retrieval also takes place (if applicable, e.g., when we need to retrieve the relativizer who i ); • the parse tree is then created and, finally, • visual attention is moved to the next word w + 1 at the same time as the motor module is instructed to press a key to reveal that word; • then the whole process is repeated for word w + 1.
The processes in Model 3 were staged in a more parallel fashion: after lexical retrieval, syntactic retrieval (if applicable) and syntactic parsing happened at the same time as visual-attention and motor commands were prepared and executed. This difference is schematically shown in Figs. 7.8 and 7.9.
The second way the models differ is with respect to the analysis of subject gaps. Models 1 and 3 assume that the parser predictively postulates the subject gap immediately after reading the wh-word (word 3 in (3) and (4)). This strategy should slow down the parser on the wh-word itself, since it has to postulate the upcoming gap when reading it. But the strategy predicts that the parser will speed up when reading the following word in the subject-relative clause sentence (3), since the parser has already postulated the gap and nothing further needs to be done to parse the gap.
In contrast, Model 2 assumes that no subject gap is predictively postulated when reading the wh-word: the gap is parsed bottom-up. This strategy predicts faster reading times on the wh-word compared to Models 1 and 3. But it also predicts a slowdown on the next word in the subject-relative clause sentence (3), since it is at this point that the subject gap is parsed/postulated. Why would we compare these three models? The main reason is to test two distinct hypotheses (qualitative/symbolic theories) about the human processor. These hypotheses are commonly entertained in psycholinguistics, but are not usually fully formalized and computationally implemented.
And it is important to realize that we can never really establish at an informal level if hypotheses embedded in complex competence-performance theories like the ones we're entertaining here make correct predictions. To really test hypotheses and theories at this level of complexity, we need to fully formalize and computationally implement them, and then attempt to fit them to experimental data. We don't really know what a complex model does until we run it.
The two hypotheses we test are the following. First, we implement and test the standard assumption that the parser is predictive and fills in gap positions before they appear (cf. Stowe 1986;Traxler and Pickering 1996). Given that hypothesis, we expect that Models 1 and 3 fit the reading data better than Model 2.
Second, it is commonly assumed that processing is to some degree parallel. In particular, a standard assumption of one of the leading models of eye movement (E-Z Reader, Warren et al. 2009) is that moving visual attention to word n + 1 happens alongside with the syntactic integration of word n. Under this hypothesis, we expect Model 3 to have a better fit than Models 1 and 2.
Both predictions turn out to be correct, supporting previous claims and showing that these claims hold under the careful scrutiny of fully formalized and computationally implemented end-to-end models that are quantitatively fit to experimental data.
Equally importantly, this shows that our Bayes+ACT-R framework can be used to quantitatively test and compare qualitative (symbolic) hypotheses about cognitive processes underlying syntactic processing.
The code for Model 3 is linked to in the appendix to this chapter (see Sect. 7.7.2). The three models were fit to experimental data from Grodner and Gibson (2005) (their Experiment 1) using the Bayesian methods described previously in this chapter.
Four parameters were estimated: the k parameter, which scales the effect of visual distance, the rule firing parameter r , the latency factor lf and the latency exponent le. Of these, only the first two have not been discussed in this chapter.
The rule firing parameter specifies how much time any rule should take to fire and has been (tacitly) used throughout the whole book. The default value of this parameter, which we always used up to this point, is 50 ms. The k parameter is used to modulate the amount of time that visual encoding T enc takes, and it has been discussed in Chap. 4, Sect. 4.3.1. 11 We fit the k parameter mainly to show that parameters associated with peripherals (e.g., the visual and motor modules) can be jointly estimated with procedural and declarative memory parameters when fitting full models to data.
After we fit (the parameters of) the three models to data, we collect the posterior predictions for the 9 ROIs in the two (subject-gap and object-gap) conditions, plotted in Figs.7.10, 7.11 and 7.12. The diamonds in these graphs indicate the actual, observed mean RTs for each word from Grodner and Gibson (2005). The bars provide the 95% CRIs for the posterior mean RTs, which are plotted as filled dots.
It is important to note that what we estimate here are parameters for the full process of reading the 9 ROIs. We do not estimate means and CRIs region by region (which is the current standard in the field), falsely assuming independence and leaving the underlying dependency structure, i.e., the parsing process, largely implicit and unexamined. Figure 7.10 shows that Model 1 captures wh-gap retrieval well: the observed mean reading times on the 3rd word (sent) in the top panel (subj-gap) and the 5th word (also sent) in the bottom panel (obj-gap) fall within the CRI. However, the spillover effect on the word after the object gap-the 6th word (to) in the bottom panel-is not captured: the model underestimates it pretty severely. 11 Recall that T enc is specified by the function K · (− log f ) · e kd , where k is the parameter that scales the effect of visual distance d measured in degrees of visual angle, and f is the (normalized) frequency of the object (word) being encoded. However, since word frequency affects lexical retrieval, we do not need to use it in visual encoding, so we substitute word length (a straightforward visual property of a word) for − log f . K is another parameter, set to its default value of 0.01. The posterior predictions of Model 2, provided in Fig. 7.11, are clearly worse: the 95% CRIs are completely below the observed mean RTs for the wh-word in both conditions, and also for the word immediately following the wh-word in the object-gap condition. This indicates that the model underestimates the parsing work triggered by the wh-word, and it also underestimates the reanalysis work that needs to be done on the word immediately following the wh-word in the object-gap condition.
Finally, Model 3 is the best among these three models. It captures the spillover effect for object gaps and increases the precision of the estimates (note the smaller CRIs). At the same time, Model 3 maintains the good fit exhibited by Model 1 (but not Model 2) for the wh-word and the following word in both conditions. This is shown in Fig. 7.12. As we already mentioned, the code for this final and most successful model is linked to in the appendix to this chapter (Sect. 7.7.2).
This relatively informal quantitative comparison between models can be made more precise by using WAIC measures for model comparison. For example, if we use WAIC 2 , 12 which is variance based, we can clearly see that Model 3 has the most precise posterior estimates for the Grodner and Gibson (2005) data; see Brasoveanu and Dotlačil (2018) for more details.
Thus, we see that the left-corner parser, first introduced in Chap. 4, can be extended with a detailed, independently motivated theory of lexical and syntactic retrieval. The resulting model can successfully simulate reading time data from a self-paced reading experiment.
The fact that the three models we considered help us distinguish between several theoretical assumptions, and that the model with the best fit implements hypotheses that we expect to be correct for independent reasons, is encouraging for the whole enterprise of computational cognitive modeling pursued in this book. Finally, we see that Model 3 does not show any clear deficiencies in simulating the mean RTs of self-paced reading tasks, even though it presupposes one and the same framework for both lexical and syntactic retrieval. This supports the ACT-R position of general recall mechanisms across various cognitive sub-domains, including linguistic sub-domains such as lexical and syntactic memory.
Before concluding, we have to point out that, even though the investigation presented in Brasoveanu and Dotlačil (2018) and summarized in this section is very promising, it is rather preliminary, particularly when compared to the models in the rest of this chapter and the rest of this book. Furthermore, the estimates of the three models we just discussed were obtained using different sampling methods than the ones we use for the Bayes+ACT-R models throughout this book.
Improving on these preliminary results and models, and investigating if the sampling methods used in this book would substantially benefit the ACT-R models in Brasoveanu and Dotlačil (2018) is left for a future occasion.

Conclusion
The models discussed in this chapter show that the present computational implementation of ACT-R can be used to successfully fit data from various linguistic experiments, as well as compare and evaluate assumptions about underlying linguistic representations and parsing processes. While one could investigate many experiments using the presented methodology, we opted for a different approach here, focusing only on a handful of studies and dissecting modeling assumptions and the way computational modeling can be done in our Bayes+ACT-R/pymc3+pyactr framework.
We take the results to be encouraging. We believe they provide clear evidence that developing precise and quantitatively accurate computational models of lexical access and syntactic processing is possible in the proposed Bayes+ACT-R framework, and a fruitful way to pursue linguistic theory development.
Unfortunately, the models are clearly incomplete with respect to many aspects of natural language use in daily conversation. An important aspect that is completely missing in these models stands out: natural language meaning and interpretation.
Our goal in conversation is to understand what others tell us, not (just) recall lexical information and meticulously parse others' messages into syntactic trees. Ultimately, we construct meaning representations, and computational cognitive models should have something to say about that. This is precisely what the next two chapters of this book will address.

Appendix: The Bayes and Bayes+ACT-R Models
All the code discussed in this chapter is available on GitHub as part of the repository https://github.com/abrsvn/pyactr-book. If you want to examine it and run it, install pyactr (see Chap. 1), download the files and run them the same way as any other Python script.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.