Behavior Research Methods

, Volume 45, Issue 4, pp 1182–1190 | Cite as

Reading time data for evaluating broad-coverage models of English sentence processing

  • Stefan L. Frank
  • Irene Fernandez Monsalve
  • Robin L. Thompson
  • Gabriella Vigliocco


We make available word-by-word self-paced reading times and eye-tracking data over a sample of English sentences from narrative sources. These data are intended to form a gold standard for the evaluation of computational psycholinguistic models of sentence comprehension in English. We describe stimuli selection and data collection and present descriptive statistics, as well as comparisons between the two sets of reading times.


Word-reading time Self-paced reading Eye tracking Sentence comprehension Model evaluation 


In recent years, models from the field of computational linguistics have increasingly been used for explaining psycholinguistic data, and conversely, psycholinguistic data have increasingly been used to evaluate computational models. In contrast to typical psychological models, the algorithms developed by computational linguists are rarely intended to explain specific phenomena. Rather, they have broad coverage, being able to handle sentences in natural language. As such, it has become common practice to evaluate these models by comparing their word-level predictions with human word-reading times collected over general texts. The data set most often used in this context is the Dundee corpus (Kennedy & Pynte, 2005), comprising eye-tracking data of 10 subjects reading newspaper editorials in English. These data have been used for model evaluation by, among others, Demberg and Keller (2008); Fossum and Levy (2012); Frank and Bod (2011), and Mitchell, Lapata, Demberg, and Keller (2010). Alternatively, model predictions have also been compared with self-paced reading data over English narrative texts (Roark, Bachrach, Cardenas, & Pallier, 2009; Wu, Bachrach, Cardenas, & Schuler, 2010).

Two potential problems arise when using newspaper or narrative texts for model evaluation. First, the sentence-processing models invariably treat sentences as independent entities, whereas the interpretation of a sentence in text depends on the previous sentences. Second, and perhaps more important, understanding newspaper or narrative texts requires vast amounts of extra-linguistic knowledge to which the models have no access. For these reasons, a more appropriate data set for model evaluation would consist of independent sentences that can be understood out of context. One such data set is the Potsdam Corpus of 144 German sentences with eye-tracking data of 222 subjects (Kliegl, Nuthmann, & Engbert, 2006), used for model evaluation by Boston, Hale, Patil, Kliegl, and Vasishth (2008) and Boston, Hale, Vasishth, and Kliegl (2011). However, the Potsdam Corpus sentences are unlikely to form a representative sample of the language, because they were manually constructed as experimental stimuli.

Here, we present a collection of reading time data that is intended to serve as a gold standard for evaluating computational psycholinguistic models of English sentence comprehension. The word-reading times were obtained using two different paradigms: self-paced reading and eye tracking. These data were collected over independent English sentences that (unlike those of the Potsdam Corpus) were not constructed for experimental purposes. Instead, they were selected from different narrative sources and, as such, can be considered a “sample” of written narrative English, comprising a wide range of structures.

The following sections discuss the selection of stimulus material, provide details of how the reading times were collected, and present descriptive statistics of the data, as well as results of comparisons between the two paradigms. All materials and data are available as online supplementary material. Appendix 1 lists the available data files and describes their content.


Stimuli selection

On the Web site, aspiring authors can upload their (otherwise unpublished) work. We selected three novels that were categorized under different genres, all of which used British English spelling. These were Aercu by Elisabeth Kershaw, The Unlikely Hero by Jocelyn Shanks, and Hamsters! (or: What I Did On My Holidays by Emily Murray) by Daniel Derrett.

Next, a list of 7,754 words was constructed by merging the list of high-frequency content words used in Andrews, Vigliocco, and Vinson (2009) with the 200 most frequent English words (mostly function words). The list included two punctuation marks: the comma and the period. From the three novels, all sentences were selected that contained only words from the word list, were at least five words long, and included at least two content words.1 Of the resulting sentences, we hand-picked the 361 that could be interpreted out of context with only minimal involvement of extra-linguistic knowledge. The average sentence length was 13.7 words (SD, 6.36; median, 12; maximum, 38). For 166 of the sentences, yes/no comprehension questions were constructed.

Obvious typos and grammatical errors were fixed. To prevent subjects from connecting the sentence stimuli into a story, proper names were replaced by same-gender names (from the high-frequency word list) such that no name appeared more than twice across the stimuli. A selection of the resulting sentences is presented in Appendix 2.

Fernandez Monsalve, Frank, and Vigliocco (2012) annotated the sentences with part-of-speech tags—that is, labels indicating the words’ syntactical categories. Labels were first generated automatically by a part-of-speech tagging algorithm (Tsuruoka & Tsuji, 2005), after which they were manually corrected to comply with the Penn Treebank tagging guidelines (Santorini, 1991). This annotation is also available as online supplementary material but is not relevant to the collection of reading time data.

Self-paced reading


One hundred seventeen first-year students (92 females, 70 native English speakers, mean age = 18.9 years) of psychology at University College London participated in the self-paced reading study as part of their undergraduate degree.


The study was preceded by an unrelated lexical decision experiment that took approximately 10 min, after which followed six self-paced reading practice trials. Sentence stimuli were repeatedly selected at random from the 361 experimental sentences until 40 min had elapsed (including time spent on the lexical decision and practice trials). Each sentence was preceded by a fixation cross, presented centrally on a computer monitor. As soon as the subject pressed the space bar, the fixation cross was replaced by the sentence’s first word in 40-point Courier New font. At each subsequent keypress, the current word was replaced by the next, always at the same central location on the display. The time between word presentation and keypress was recorded as the reading time on that word. Punctuation marks were presented together with the directly preceding word. After completion of a sentence, the corresponding comprehension question (if any) appeared, and subjects responded yes or no by pressing the J or F key, respectively.

Eye tracking

Materials and subjects

Of the original 361 sentence stimuli, the 205 that fit on a single line of the display were used in the eye-tracking study. These selected sentences are listed in Appendix 2.

Forty-eight subjects, recruited from the University College London subject pool, took part, for which they were paid £7. Four subjects were excluded due to technical issues, and 1 because he had already taken part in the self-paced reading study, leaving 43 subjects (27 females, 37 native English speakers, mean age = 25.8 years) with analyzed data.


Subjects were seated 50 cm from the monitor with their chin on a chinrest. Both eyes were tracked using a head-mounted eyetracker (SR Research, EyeLink II). Individual sentences were presented in 18-point Courier font, left-aligned on the display. Punctuation marks were attached to the directly preceding word. Each sentence was preceded by a left-aligned fixation cross that was presented for 800 ms. Gaze direction was sampled at a rate of 500 Hz.

After initial calibration (nine fixation points) and 5 practice trials, subjects were invited to ask clarification questions, and the experiment began. Another calibration check was performed after the practice items and then again after every 35 trials, at which time subjects took a self-paced break (the final set had only 30 trials, for a total of 205 trials over six sets). Additionally, drift correction on a single centrally located fixation point was performed before the start of each trial. Responses were recorded using a mouse (center button to continue after finishing a sentence; right and left buttons to respond yes or no, respectively, to comprehension questions). The entire experiment, including instructions and calibration, took approximately 60 min to complete. The order of sentence presentation was randomized throughout.

Word-reading times

A word was considered as not fixated if the first fixation on that word occurred after any fixation on a word further to the right. Four measures of reading time were extracted from the eye-tracking data:
  • First-fixation time: duration of only the first fixation on the current word.

  • First-pass time (also known as gaze duration): summed duration of all fixations on the current word before the first fixation on any other word.

  • Right-bounded time: summed duration of all fixations on the current word before the first fixation on a word further to the right.

  • Go-past time (also known as regression-path time): summed duration of all fixations from the first fixation on the current word up to (but not including) the first fixation on a word further to the right. Note that this often includes fixations on words to the left of the current word.

Results and discussion

Error rates

The percentage of response errors to comprehension questions was significantly larger in the self-paced reading study, as compared with eye tracking (15.8% and 11.5%, respectively), z = 6.91, p < .0001. As is apparent from the histograms in Fig. 1, this difference was mainly caused by a number of self-paced reading subjects who scored very badly, which we ascribe to a lack of motivation due to their compulsory participation. From hereon, we discard data by subjects who had an error rate above 25%, which leaves 104 self-paced reading and 42 eye-tracking subjects and decreases the respective average error rates to 13.0% and 11.2%. Although the difference in error rate between the two paradigms has now been reduced, it remains statistically significant, z = 2.99, p < .003.
Fig. 1

Distribution of subjects over error rates

Reading times

Predictions by different broad-coverage statistical language models were evaluated on the native speakers’ self-paced reading (Fernandez Monsalve et al., 2012; Frank, in press) and eye tracking (Frank & Thompson, 2012) data.2 All three studies showed that a word takes longer to read if the models predict that it will lead to increased cognitive-processing load. This establishes the validity of the models, but also of the data.

Here, we focus on descriptive statistics of the reading time data and on comparing the two experimental paradigms. Words attached to punctuation (including all sentence-final words) and nonfixated words were not included in the analysis (the percentage of nonfixated words was 34.0% overall but varied widely among subjects, from 5.5% to 60.4%). Also, three sentences that contained a typo in the self-paced reading study were removed, leaving 274,893 and 44,371 data points from self-paced reading and eye tracking, respectively.

Reading time distributions

The distribution of each set of reading time data (aggregated over all subjects and after log-transformation) can be seen in the boxplots of Fig. 2. It is clear that the distributions have “fat tails”; that is, they are not normally distributed but have excess kurtosis. This is also evident from the quantile–quantile plots of Fig. 3, where log-transformed reading times are standardized per subject. None of the distributions are normal, as is confirmed by Lilliefors tests for normality (Lilliefors, 1967), the test statistics of which are also plotted in the figure (all ps < .001).
Fig. 2

Boxplots of log-transformed reading time data. Boxes denote the lower quartile, median, and upper quartile; whiskers extend to 1.5 times the interquartile range. Note that labels on the vertical axis are nontransformed reading times

Fig. 3

Quantile–quantile plots of standardized log-transformed reading times against a standard normal distribution and L-statistics of Lilliefors tests of normality (larger L corresponds to stronger deviance from a normal distribution)

Comparison between paradigms

Table 1 shows the Spearman rank correlations between reading times (averaged over subjects) from the self-paced reading and eye-tracking studies. In order to detect possible spillover and parafoveal preview effects, correlations are also presented for “misaligned” reading times; that is, reading times from eye tracking are correlated to the self-paced reading time of the previous or next word.
Table 1

Spearman’s rank correlation between eye-tracking reading times on words at position t and self-paced reading times of words at t1, t, and t + 1


Eye tracking RT at word t


First fix

First pass

Right bound

Go past

Self-paced RT











t + 1





*p < .05

**p < .0001

The correlations between eye tracking and self-paced reading times are much larger when looking at self-paced times on the current or next word (second and third rows of Table 1) than on the previous word (first row). This means that the keypresses in the self-paced reading study lag behind eye fixations, whereas the reverse is less evident. Apparently, there is a stronger spillover effect in self-paced reading than in eye tracking, as is consistently found when comparing the two paradigms (e.g., Just, Carpenter, & Woolley, 1982; Witzel, Witzel, & Forster, 2012). In addition, the positive correlation between fixation durations at t and self-paced reading times at t + 1 could be due to parafoveal preview in the eye-tracking study. In that case, processing difficulty caused by the upcoming word is already apparent while the eye is still fixated on the current word. However, whether such parafoveal-on-foveal effects indeed occur is controversial (for a review, see Schotter, Angele, & Rayner, 2012).


The data we have presented can further the field of computational psycholinguistics by allowing for a comparison among different model’s psychological adequacy on one and the same set of data. As was argued in the Introduction, these data may be more appropriate for such evaluation than the reading times over narrative or expository texts that have been used so far. Also, our sample of sentences is more representative of the language than are sentences that are constructed to serve as experimental stimuli.

Although we have only discussed reading times here, it is also possible to look at other aspects of the eye-tracking data—in particular, regression probabilities (the supplementary material includes fixation-level data). Moreover, a sophisticated cognitive model of sentence comprehension and question answering may even be validated against the error rates for comprehension questions; that is, a question that leads to more errors by humans should also be more difficult to answer for the model.

The present data set can easily be extended by increasing the number of subjects, but more interesting future additions could include sentences from different source types (e.g., expository texts), additional data types (e.g., EEG), and material in other languages.


  1. 1.

    As an additional constraint, at least one of the words had to be one for which Andrews et al. (2009) provided human-generated semantic features. This was to allow for investigating possible within-sentence semantic priming effects.

  2. 2.

     Frank and Thompson (2012) included only the 17 monolingual subjects who were tested at the time.


Author note

The research presented here was funded by the European Union Seventh Framework Programme (FP7/2007-2013) under Grant 253803 awarded to the first author, and by a grant from the Economic and Social Research Council of Great Britain (RES-620-28-6001) awarded to the Deafness Cognition and Language Research Centre.

Supplementary material (3.8 mb)
ESM 1 (ZIP 3.79 mb)


  1. Andrews, M., Vigliocco, G., & Vinson, D. (2009). Integrating experiential and distributional data to learn semantic representations. Psychological Review, 116, 463–498.PubMedCrossRefGoogle Scholar
  2. Boston, M. F., Hale, J., Patil, U., Kliegl, R., & Vasishth, S. (2008). Parsing costs as predictors of reading difficulty: An evaluation using the Potsdam Sentence Corpus. Journal of Eye Movement Research, 2, 1–12.Google Scholar
  3. Boston, M. F., Hale, J. T., Vasishth, S., & Kliegl, R. (2011). Parallel processing and sentence comprehension difficulty. Language & Cognitive Processes, 26, 301–349.CrossRefGoogle Scholar
  4. Demberg, V., & Keller, F. (2008). Data from eye-tracking corpora as evidence for theories of syntactic processing complexity. Cognition, 109, 193–210.PubMedCrossRefGoogle Scholar
  5. Fernandez Monsalve, I., Frank, S. L., & Vigliocco, G. (2012). Lexical surprisal as a general predictor of reading time. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 398–408). Avignon, France: Association for Computational Linguistics.Google Scholar
  6. Fossum, V. & Levy, R. (2012). Sequential vs. hierarchical syntactic models of human incremental sentence processing. In Proceedings of the 3rd Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2012) (pp. 61–69). Montréal, Canada: Association for Computational Linguistics.Google Scholar
  7. Frank, S. L., & Bod, R. (2011). Insensitivity of the human sentence-processing system to hierarchical structure. Psychological Science, 22, 829–834.PubMedCrossRefGoogle Scholar
  8. Frank, S. L. in press. Uncertainty reduction as a measure of cognitive processing load in sentence comprehension. Topics in Cognitive Science.Google Scholar
  9. Frank, S. L., & Thompson, R. L. (2012). Early effects of word surprisal on pupil size during reading. In N. Miyake, D. Peebles, & R. P. Cooper (Eds.), Proceedings of the 34th Annual Conference of the Cognitive Science Society (pp. 1554–1559). Austin: Cognitive Science Society.Google Scholar
  10. Just, M. A., Carpenter, P. A., & Woolley, J. D. (1982). Paradigms and processes in reading comprehension. Journal of Experimental Psychology. General, 111, 228–238.PubMedCrossRefGoogle Scholar
  11. Kennedy, A., & Pynte, J. (2005). Parafoveal-on-foveal effects in normal reading. Vision Research, 45, 153–168.PubMedCrossRefGoogle Scholar
  12. Kliegl, R., Nuthmann, A., & Engbert, R. (2006). Tracking the mind during reading: The influence of past, present, and future words on fixation durations. Journal of Experimental Psychology. General, 135, 12–35.PubMedCrossRefGoogle Scholar
  13. Lilliefors, H. W. (1967). On the Kolmogorov-Smirnov test for normality with mean and variance unknown. Journal of the American Statistical Association, 62, 399–402.CrossRefGoogle Scholar
  14. Mitchell, J., Lapata, M., Demberg, V., & Keller, F. (2010). Syntactic and semantic factors in processing difficulty: An integrated measure. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 196–206). Uppsala, Sweden: Association for Computational Linguistics.Google Scholar
  15. Roark, B., Bachrach, A., Cardenas, C., & Pallier, C. (2009). Deriving lexical and syntactic expectation-based measures for psycholinguistic modeling via incremental top-down parsing. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (pp. 324–333). Association for Computational Linguistics.Google Scholar
  16. Santorini, B. (1991). Part-of-speech tagging guidelines for the Penn Treebank project. Philadelphia, PA: University of Pennsylvania.Google Scholar
  17. Schotter, E. R., Angele, B., & Rayner, K. (2012). Parafoveal processing in reading. Attention, Perception, & Psychophysics, 74, 5–35.CrossRefGoogle Scholar
  18. Tsuruoka, Y. & Tsujii, J. (2005). Bidirectional inference with the easiest-first strategy for tagging sequence data. In Proceedings of the conference on human language technology and empirical methods in natural language processing (pp. 467–474). Morristown, NJ: Association for Computational Linguistics.Google Scholar
  19. Witzel, N., Witzel, J., & Forster, K. (2012). Comparisons on online reading paradigms: Eye tracking, moving-window, and maze. Journal of Psycholinguistic Research, 41, 105–128.PubMedCrossRefGoogle Scholar
  20. Wu, S., Bachrach, A., Cardenas, C., & Schuler, W. (2010). Complexity metrics in an incremental right-corner parser. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 1189–1198). Uppsala, Sweden: Association for Computational Linguistics.Google Scholar

Copyright information

© Psychonomic Society, Inc. 2013

Authors and Affiliations

  • Stefan L. Frank
    • 1
  • Irene Fernandez Monsalve
    • 1
  • Robin L. Thompson
    • 1
    • 2
  • Gabriella Vigliocco
    • 1
  1. 1.Department of Cognitive, Perceptual and Brain SciencesUniversity College LondonLondonUK
  2. 2.Deafness, Cognition and Language Research CentreLondonUK

Personalised recommendations