Introduction

In recent years, models from the field of computational linguistics have increasingly been used for explaining psycholinguistic data, and conversely, psycholinguistic data have increasingly been used to evaluate computational models. In contrast to typical psychological models, the algorithms developed by computational linguists are rarely intended to explain specific phenomena. Rather, they have broad coverage, being able to handle sentences in natural language. As such, it has become common practice to evaluate these models by comparing their word-level predictions with human word-reading times collected over general texts. The data set most often used in this context is the Dundee corpus (Kennedy & Pynte, 2005), comprising eye-tracking data of 10 subjects reading newspaper editorials in English. These data have been used for model evaluation by, among others, Demberg and Keller (2008); Fossum and Levy (2012); Frank and Bod (2011), and Mitchell, Lapata, Demberg, and Keller (2010). Alternatively, model predictions have also been compared with self-paced reading data over English narrative texts (Roark, Bachrach, Cardenas, & Pallier, 2009; Wu, Bachrach, Cardenas, & Schuler, 2010).

Two potential problems arise when using newspaper or narrative texts for model evaluation. First, the sentence-processing models invariably treat sentences as independent entities, whereas the interpretation of a sentence in text depends on the previous sentences. Second, and perhaps more important, understanding newspaper or narrative texts requires vast amounts of extra-linguistic knowledge to which the models have no access. For these reasons, a more appropriate data set for model evaluation would consist of independent sentences that can be understood out of context. One such data set is the Potsdam Corpus of 144 German sentences with eye-tracking data of 222 subjects (Kliegl, Nuthmann, & Engbert, 2006), used for model evaluation by Boston, Hale, Patil, Kliegl, and Vasishth (2008) and Boston, Hale, Vasishth, and Kliegl (2011). However, the Potsdam Corpus sentences are unlikely to form a representative sample of the language, because they were manually constructed as experimental stimuli.

Here, we present a collection of reading time data that is intended to serve as a gold standard for evaluating computational psycholinguistic models of English sentence comprehension. The word-reading times were obtained using two different paradigms: self-paced reading and eye tracking. These data were collected over independent English sentences that (unlike those of the Potsdam Corpus) were not constructed for experimental purposes. Instead, they were selected from different narrative sources and, as such, can be considered a “sample” of written narrative English, comprising a wide range of structures.

The following sections discuss the selection of stimulus material, provide details of how the reading times were collected, and present descriptive statistics of the data, as well as results of comparisons between the two paradigms. All materials and data are available as online supplementary material. Appendix 1 lists the available data files and describes their content.

Method

Stimuli selection

On the Web site www.free-online-novels.com, aspiring authors can upload their (otherwise unpublished) work. We selected three novels that were categorized under different genres, all of which used British English spelling. These were Aercu by Elisabeth Kershaw, The Unlikely Hero by Jocelyn Shanks, and Hamsters! (or: What I Did On My Holidays by Emily Murray) by Daniel Derrett.

Next, a list of 7,754 words was constructed by merging the list of high-frequency content words used in Andrews, Vigliocco, and Vinson (2009) with the 200 most frequent English words (mostly function words). The list included two punctuation marks: the comma and the period. From the three novels, all sentences were selected that contained only words from the word list, were at least five words long, and included at least two content words.Footnote 1 Of the resulting sentences, we hand-picked the 361 that could be interpreted out of context with only minimal involvement of extra-linguistic knowledge. The average sentence length was 13.7 words (SD, 6.36; median, 12; maximum, 38). For 166 of the sentences, yes/no comprehension questions were constructed.

Obvious typos and grammatical errors were fixed. To prevent subjects from connecting the sentence stimuli into a story, proper names were replaced by same-gender names (from the high-frequency word list) such that no name appeared more than twice across the stimuli. A selection of the resulting sentences is presented in Appendix 2.

Fernandez Monsalve, Frank, and Vigliocco (2012) annotated the sentences with part-of-speech tags—that is, labels indicating the words’ syntactical categories. Labels were first generated automatically by a part-of-speech tagging algorithm (Tsuruoka & Tsuji, 2005), after which they were manually corrected to comply with the Penn Treebank tagging guidelines (Santorini, 1991). This annotation is also available as online supplementary material but is not relevant to the collection of reading time data.

Self-paced reading

Subjects

One hundred seventeen first-year students (92 females, 70 native English speakers, mean age = 18.9 years) of psychology at University College London participated in the self-paced reading study as part of their undergraduate degree.

Procedure

The study was preceded by an unrelated lexical decision experiment that took approximately 10 min, after which followed six self-paced reading practice trials. Sentence stimuli were repeatedly selected at random from the 361 experimental sentences until 40 min had elapsed (including time spent on the lexical decision and practice trials). Each sentence was preceded by a fixation cross, presented centrally on a computer monitor. As soon as the subject pressed the space bar, the fixation cross was replaced by the sentence’s first word in 40-point Courier New font. At each subsequent keypress, the current word was replaced by the next, always at the same central location on the display. The time between word presentation and keypress was recorded as the reading time on that word. Punctuation marks were presented together with the directly preceding word. After completion of a sentence, the corresponding comprehension question (if any) appeared, and subjects responded yes or no by pressing the J or F key, respectively.

Eye tracking

Materials and subjects

Of the original 361 sentence stimuli, the 205 that fit on a single line of the display were used in the eye-tracking study. These selected sentences are listed in Appendix 2.

Forty-eight subjects, recruited from the University College London subject pool, took part, for which they were paid £7. Four subjects were excluded due to technical issues, and 1 because he had already taken part in the self-paced reading study, leaving 43 subjects (27 females, 37 native English speakers, mean age = 25.8 years) with analyzed data.

Procedure

Subjects were seated 50 cm from the monitor with their chin on a chinrest. Both eyes were tracked using a head-mounted eyetracker (SR Research, EyeLink II). Individual sentences were presented in 18-point Courier font, left-aligned on the display. Punctuation marks were attached to the directly preceding word. Each sentence was preceded by a left-aligned fixation cross that was presented for 800 ms. Gaze direction was sampled at a rate of 500 Hz.

After initial calibration (nine fixation points) and 5 practice trials, subjects were invited to ask clarification questions, and the experiment began. Another calibration check was performed after the practice items and then again after every 35 trials, at which time subjects took a self-paced break (the final set had only 30 trials, for a total of 205 trials over six sets). Additionally, drift correction on a single centrally located fixation point was performed before the start of each trial. Responses were recorded using a mouse (center button to continue after finishing a sentence; right and left buttons to respond yes or no, respectively, to comprehension questions). The entire experiment, including instructions and calibration, took approximately 60 min to complete. The order of sentence presentation was randomized throughout.

Word-reading times

A word was considered as not fixated if the first fixation on that word occurred after any fixation on a word further to the right. Four measures of reading time were extracted from the eye-tracking data:

  • First-fixation time: duration of only the first fixation on the current word.

  • First-pass time (also known as gaze duration): summed duration of all fixations on the current word before the first fixation on any other word.

  • Right-bounded time: summed duration of all fixations on the current word before the first fixation on a word further to the right.

  • Go-past time (also known as regression-path time): summed duration of all fixations from the first fixation on the current word up to (but not including) the first fixation on a word further to the right. Note that this often includes fixations on words to the left of the current word.

Results and discussion

Error rates

The percentage of response errors to comprehension questions was significantly larger in the self-paced reading study, as compared with eye tracking (15.8% and 11.5%, respectively), z = 6.91, p < .0001. As is apparent from the histograms in Fig. 1, this difference was mainly caused by a number of self-paced reading subjects who scored very badly, which we ascribe to a lack of motivation due to their compulsory participation. From hereon, we discard data by subjects who had an error rate above 25%, which leaves 104 self-paced reading and 42 eye-tracking subjects and decreases the respective average error rates to 13.0% and 11.2%. Although the difference in error rate between the two paradigms has now been reduced, it remains statistically significant, z = 2.99, p < .003.

Fig. 1
figure 1

Distribution of subjects over error rates

Reading times

Predictions by different broad-coverage statistical language models were evaluated on the native speakers’ self-paced reading (Fernandez Monsalve et al., 2012; Frank, in press) and eye tracking (Frank & Thompson, 2012) data.Footnote 2 All three studies showed that a word takes longer to read if the models predict that it will lead to increased cognitive-processing load. This establishes the validity of the models, but also of the data.

Here, we focus on descriptive statistics of the reading time data and on comparing the two experimental paradigms. Words attached to punctuation (including all sentence-final words) and nonfixated words were not included in the analysis (the percentage of nonfixated words was 34.0% overall but varied widely among subjects, from 5.5% to 60.4%). Also, three sentences that contained a typo in the self-paced reading study were removed, leaving 274,893 and 44,371 data points from self-paced reading and eye tracking, respectively.

Reading time distributions

The distribution of each set of reading time data (aggregated over all subjects and after log-transformation) can be seen in the boxplots of Fig. 2. It is clear that the distributions have “fat tails”; that is, they are not normally distributed but have excess kurtosis. This is also evident from the quantile–quantile plots of Fig. 3, where log-transformed reading times are standardized per subject. None of the distributions are normal, as is confirmed by Lilliefors tests for normality (Lilliefors, 1967), the test statistics of which are also plotted in the figure (all ps < .001).

Fig. 2
figure 2

Boxplots of log-transformed reading time data. Boxes denote the lower quartile, median, and upper quartile; whiskers extend to 1.5 times the interquartile range. Note that labels on the vertical axis are nontransformed reading times

Fig. 3
figure 3

Quantile–quantile plots of standardized log-transformed reading times against a standard normal distribution and L-statistics of Lilliefors tests of normality (larger L corresponds to stronger deviance from a normal distribution)

Comparison between paradigms

Table 1 shows the Spearman rank correlations between reading times (averaged over subjects) from the self-paced reading and eye-tracking studies. In order to detect possible spillover and parafoveal preview effects, correlations are also presented for “misaligned” reading times; that is, reading times from eye tracking are correlated to the self-paced reading time of the previous or next word.

Table 1 Spearman’s rank correlation between eye-tracking reading times on words at position t and self-paced reading times of words at t1, t, and t + 1

The correlations between eye tracking and self-paced reading times are much larger when looking at self-paced times on the current or next word (second and third rows of Table 1) than on the previous word (first row). This means that the keypresses in the self-paced reading study lag behind eye fixations, whereas the reverse is less evident. Apparently, there is a stronger spillover effect in self-paced reading than in eye tracking, as is consistently found when comparing the two paradigms (e.g., Just, Carpenter, & Woolley, 1982; Witzel, Witzel, & Forster, 2012). In addition, the positive correlation between fixation durations at t and self-paced reading times at t + 1 could be due to parafoveal preview in the eye-tracking study. In that case, processing difficulty caused by the upcoming word is already apparent while the eye is still fixated on the current word. However, whether such parafoveal-on-foveal effects indeed occur is controversial (for a review, see Schotter, Angele, & Rayner, 2012).

Conclusion

The data we have presented can further the field of computational psycholinguistics by allowing for a comparison among different model’s psychological adequacy on one and the same set of data. As was argued in the Introduction, these data may be more appropriate for such evaluation than the reading times over narrative or expository texts that have been used so far. Also, our sample of sentences is more representative of the language than are sentences that are constructed to serve as experimental stimuli.

Although we have only discussed reading times here, it is also possible to look at other aspects of the eye-tracking data—in particular, regression probabilities (the supplementary material includes fixation-level data). Moreover, a sophisticated cognitive model of sentence comprehension and question answering may even be validated against the error rates for comprehension questions; that is, a question that leads to more errors by humans should also be more difficult to answer for the model.

The present data set can easily be extended by increasing the number of subjects, but more interesting future additions could include sentences from different source types (e.g., expository texts), additional data types (e.g., EEG), and material in other languages.