Reading time data for evaluating broad-coverage models of English sentence processing
We make available word-by-word self-paced reading times and eye-tracking data over a sample of English sentences from narrative sources. These data are intended to form a gold standard for the evaluation of computational psycholinguistic models of sentence comprehension in English. We describe stimuli selection and data collection and present descriptive statistics, as well as comparisons between the two sets of reading times.
KeywordsWord-reading time Self-paced reading Eye tracking Sentence comprehension Model evaluation
In recent years, models from the field of computational linguistics have increasingly been used for explaining psycholinguistic data, and conversely, psycholinguistic data have increasingly been used to evaluate computational models. In contrast to typical psychological models, the algorithms developed by computational linguists are rarely intended to explain specific phenomena. Rather, they have broad coverage, being able to handle sentences in natural language. As such, it has become common practice to evaluate these models by comparing their word-level predictions with human word-reading times collected over general texts. The data set most often used in this context is the Dundee corpus (Kennedy & Pynte, 2005), comprising eye-tracking data of 10 subjects reading newspaper editorials in English. These data have been used for model evaluation by, among others, Demberg and Keller (2008); Fossum and Levy (2012); Frank and Bod (2011), and Mitchell, Lapata, Demberg, and Keller (2010). Alternatively, model predictions have also been compared with self-paced reading data over English narrative texts (Roark, Bachrach, Cardenas, & Pallier, 2009; Wu, Bachrach, Cardenas, & Schuler, 2010).
Two potential problems arise when using newspaper or narrative texts for model evaluation. First, the sentence-processing models invariably treat sentences as independent entities, whereas the interpretation of a sentence in text depends on the previous sentences. Second, and perhaps more important, understanding newspaper or narrative texts requires vast amounts of extra-linguistic knowledge to which the models have no access. For these reasons, a more appropriate data set for model evaluation would consist of independent sentences that can be understood out of context. One such data set is the Potsdam Corpus of 144 German sentences with eye-tracking data of 222 subjects (Kliegl, Nuthmann, & Engbert, 2006), used for model evaluation by Boston, Hale, Patil, Kliegl, and Vasishth (2008) and Boston, Hale, Vasishth, and Kliegl (2011). However, the Potsdam Corpus sentences are unlikely to form a representative sample of the language, because they were manually constructed as experimental stimuli.
Here, we present a collection of reading time data that is intended to serve as a gold standard for evaluating computational psycholinguistic models of English sentence comprehension. The word-reading times were obtained using two different paradigms: self-paced reading and eye tracking. These data were collected over independent English sentences that (unlike those of the Potsdam Corpus) were not constructed for experimental purposes. Instead, they were selected from different narrative sources and, as such, can be considered a “sample” of written narrative English, comprising a wide range of structures.
The following sections discuss the selection of stimulus material, provide details of how the reading times were collected, and present descriptive statistics of the data, as well as results of comparisons between the two paradigms. All materials and data are available as online supplementary material. Appendix 1 lists the available data files and describes their content.
On the Web site www.free-online-novels.com, aspiring authors can upload their (otherwise unpublished) work. We selected three novels that were categorized under different genres, all of which used British English spelling. These were Aercu by Elisabeth Kershaw, The Unlikely Hero by Jocelyn Shanks, and Hamsters! (or: What I Did On My Holidays by Emily Murray) by Daniel Derrett.
Next, a list of 7,754 words was constructed by merging the list of high-frequency content words used in Andrews, Vigliocco, and Vinson (2009) with the 200 most frequent English words (mostly function words). The list included two punctuation marks: the comma and the period. From the three novels, all sentences were selected that contained only words from the word list, were at least five words long, and included at least two content words.1 Of the resulting sentences, we hand-picked the 361 that could be interpreted out of context with only minimal involvement of extra-linguistic knowledge. The average sentence length was 13.7 words (SD, 6.36; median, 12; maximum, 38). For 166 of the sentences, yes/no comprehension questions were constructed.
Obvious typos and grammatical errors were fixed. To prevent subjects from connecting the sentence stimuli into a story, proper names were replaced by same-gender names (from the high-frequency word list) such that no name appeared more than twice across the stimuli. A selection of the resulting sentences is presented in Appendix 2.
Fernandez Monsalve, Frank, and Vigliocco (2012) annotated the sentences with part-of-speech tags—that is, labels indicating the words’ syntactical categories. Labels were first generated automatically by a part-of-speech tagging algorithm (Tsuruoka & Tsuji, 2005), after which they were manually corrected to comply with the Penn Treebank tagging guidelines (Santorini, 1991). This annotation is also available as online supplementary material but is not relevant to the collection of reading time data.
One hundred seventeen first-year students (92 females, 70 native English speakers, mean age = 18.9 years) of psychology at University College London participated in the self-paced reading study as part of their undergraduate degree.
The study was preceded by an unrelated lexical decision experiment that took approximately 10 min, after which followed six self-paced reading practice trials. Sentence stimuli were repeatedly selected at random from the 361 experimental sentences until 40 min had elapsed (including time spent on the lexical decision and practice trials). Each sentence was preceded by a fixation cross, presented centrally on a computer monitor. As soon as the subject pressed the space bar, the fixation cross was replaced by the sentence’s first word in 40-point Courier New font. At each subsequent keypress, the current word was replaced by the next, always at the same central location on the display. The time between word presentation and keypress was recorded as the reading time on that word. Punctuation marks were presented together with the directly preceding word. After completion of a sentence, the corresponding comprehension question (if any) appeared, and subjects responded yes or no by pressing the J or F key, respectively.
Materials and subjects
Of the original 361 sentence stimuli, the 205 that fit on a single line of the display were used in the eye-tracking study. These selected sentences are listed in Appendix 2.
Forty-eight subjects, recruited from the University College London subject pool, took part, for which they were paid £7. Four subjects were excluded due to technical issues, and 1 because he had already taken part in the self-paced reading study, leaving 43 subjects (27 females, 37 native English speakers, mean age = 25.8 years) with analyzed data.
Subjects were seated 50 cm from the monitor with their chin on a chinrest. Both eyes were tracked using a head-mounted eyetracker (SR Research, EyeLink II). Individual sentences were presented in 18-point Courier font, left-aligned on the display. Punctuation marks were attached to the directly preceding word. Each sentence was preceded by a left-aligned fixation cross that was presented for 800 ms. Gaze direction was sampled at a rate of 500 Hz.
After initial calibration (nine fixation points) and 5 practice trials, subjects were invited to ask clarification questions, and the experiment began. Another calibration check was performed after the practice items and then again after every 35 trials, at which time subjects took a self-paced break (the final set had only 30 trials, for a total of 205 trials over six sets). Additionally, drift correction on a single centrally located fixation point was performed before the start of each trial. Responses were recorded using a mouse (center button to continue after finishing a sentence; right and left buttons to respond yes or no, respectively, to comprehension questions). The entire experiment, including instructions and calibration, took approximately 60 min to complete. The order of sentence presentation was randomized throughout.
First-fixation time: duration of only the first fixation on the current word.
First-pass time (also known as gaze duration): summed duration of all fixations on the current word before the first fixation on any other word.
Right-bounded time: summed duration of all fixations on the current word before the first fixation on a word further to the right.
Go-past time (also known as regression-path time): summed duration of all fixations from the first fixation on the current word up to (but not including) the first fixation on a word further to the right. Note that this often includes fixations on words to the left of the current word.
Results and discussion
Predictions by different broad-coverage statistical language models were evaluated on the native speakers’ self-paced reading (Fernandez Monsalve et al., 2012; Frank, in press) and eye tracking (Frank & Thompson, 2012) data.2 All three studies showed that a word takes longer to read if the models predict that it will lead to increased cognitive-processing load. This establishes the validity of the models, but also of the data.
Here, we focus on descriptive statistics of the reading time data and on comparing the two experimental paradigms. Words attached to punctuation (including all sentence-final words) and nonfixated words were not included in the analysis (the percentage of nonfixated words was 34.0% overall but varied widely among subjects, from 5.5% to 60.4%). Also, three sentences that contained a typo in the self-paced reading study were removed, leaving 274,893 and 44,371 data points from self-paced reading and eye tracking, respectively.
Reading time distributions
Comparison between paradigms
Spearman’s rank correlation between eye-tracking reading times on words at position t and self-paced reading times of words at t − 1, t, and t + 1
Eye tracking RT at word t
t − 1
t + 1
The correlations between eye tracking and self-paced reading times are much larger when looking at self-paced times on the current or next word (second and third rows of Table 1) than on the previous word (first row). This means that the keypresses in the self-paced reading study lag behind eye fixations, whereas the reverse is less evident. Apparently, there is a stronger spillover effect in self-paced reading than in eye tracking, as is consistently found when comparing the two paradigms (e.g., Just, Carpenter, & Woolley, 1982; Witzel, Witzel, & Forster, 2012). In addition, the positive correlation between fixation durations at t and self-paced reading times at t + 1 could be due to parafoveal preview in the eye-tracking study. In that case, processing difficulty caused by the upcoming word is already apparent while the eye is still fixated on the current word. However, whether such parafoveal-on-foveal effects indeed occur is controversial (for a review, see Schotter, Angele, & Rayner, 2012).
The data we have presented can further the field of computational psycholinguistics by allowing for a comparison among different model’s psychological adequacy on one and the same set of data. As was argued in the Introduction, these data may be more appropriate for such evaluation than the reading times over narrative or expository texts that have been used so far. Also, our sample of sentences is more representative of the language than are sentences that are constructed to serve as experimental stimuli.
Although we have only discussed reading times here, it is also possible to look at other aspects of the eye-tracking data—in particular, regression probabilities (the supplementary material includes fixation-level data). Moreover, a sophisticated cognitive model of sentence comprehension and question answering may even be validated against the error rates for comprehension questions; that is, a question that leads to more errors by humans should also be more difficult to answer for the model.
The present data set can easily be extended by increasing the number of subjects, but more interesting future additions could include sentences from different source types (e.g., expository texts), additional data types (e.g., EEG), and material in other languages.
As an additional constraint, at least one of the words had to be one for which Andrews et al. (2009) provided human-generated semantic features. This was to allow for investigating possible within-sentence semantic priming effects.
Frank and Thompson (2012) included only the 17 monolingual subjects who were tested at the time.
The research presented here was funded by the European Union Seventh Framework Programme (FP7/2007-2013) under Grant 253803 awarded to the first author, and by a grant from the Economic and Social Research Council of Great Britain (RES-620-28-6001) awarded to the Deafness Cognition and Language Research Centre.
- Boston, M. F., Hale, J., Patil, U., Kliegl, R., & Vasishth, S. (2008). Parsing costs as predictors of reading difficulty: An evaluation using the Potsdam Sentence Corpus. Journal of Eye Movement Research, 2, 1–12.Google Scholar
- Fernandez Monsalve, I., Frank, S. L., & Vigliocco, G. (2012). Lexical surprisal as a general predictor of reading time. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 398–408). Avignon, France: Association for Computational Linguistics.Google Scholar
- Fossum, V. & Levy, R. (2012). Sequential vs. hierarchical syntactic models of human incremental sentence processing. In Proceedings of the 3rd Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2012) (pp. 61–69). Montréal, Canada: Association for Computational Linguistics.Google Scholar
- Frank, S. L. in press. Uncertainty reduction as a measure of cognitive processing load in sentence comprehension. Topics in Cognitive Science.Google Scholar
- Frank, S. L., & Thompson, R. L. (2012). Early effects of word surprisal on pupil size during reading. In N. Miyake, D. Peebles, & R. P. Cooper (Eds.), Proceedings of the 34th Annual Conference of the Cognitive Science Society (pp. 1554–1559). Austin: Cognitive Science Society.Google Scholar
- Mitchell, J., Lapata, M., Demberg, V., & Keller, F. (2010). Syntactic and semantic factors in processing difficulty: An integrated measure. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 196–206). Uppsala, Sweden: Association for Computational Linguistics.Google Scholar
- Roark, B., Bachrach, A., Cardenas, C., & Pallier, C. (2009). Deriving lexical and syntactic expectation-based measures for psycholinguistic modeling via incremental top-down parsing. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (pp. 324–333). Association for Computational Linguistics.Google Scholar
- Santorini, B. (1991). Part-of-speech tagging guidelines for the Penn Treebank project. Philadelphia, PA: University of Pennsylvania.Google Scholar
- Tsuruoka, Y. & Tsujii, J. (2005). Bidirectional inference with the easiest-first strategy for tagging sequence data. In Proceedings of the conference on human language technology and empirical methods in natural language processing (pp. 467–474). Morristown, NJ: Association for Computational Linguistics.Google Scholar
- Wu, S., Bachrach, A., Cardenas, C., & Schuler, W. (2010). Complexity metrics in an incremental right-corner parser. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 1189–1198). Uppsala, Sweden: Association for Computational Linguistics.Google Scholar