Background & Summary

Skilled readers move their eyes rapidly through text, approximately four to five times per second, and can achieve a reading speed of approximately 250 words per minute1,2. When and where the eyes move are influenced by cognitive processes during reading; thus, eye movements provide rich information for studying the underlying cognitive mechanisms of reading3,4. Eye movements have been used extensively to study the cognitive mechanisms of alphabetic reading, particularly in English. A growing number of studies have used eye-tracking techniques to study Chinese reading in the last two decades. These studies have found many similarities between Chinese and alphabetic reading. For example, the fixation time and fixated probability on Chinese and alphabetic words are modulated by word frequency and word length3,5. Additionally, the script-specific mechanisms of Chinese reading, such as how Chinese readers segment words and program their eye movements without the aid of inter-word spaces, have been studied extensively6,7.

Traditional factor-designed experiments have been fruitful in revealing cognitive mechanisms in Chinese reading. However, a large-scale eye movement database can provide valuable information not available in small-scale experimental studies. Multiple complex variables affect eye movements during reading and it is challenging to manipulate or control all of them simultaneously in controlled experiments. It is also often questioned whether conclusions based on dozens of words or sentences can be generalized to unexamined linguistic materials8. A large-scale eye-movement database can overcome these problems, allowing researchers to simultaneously examine the effects of multiple factors on reading behaviors and ensure the generalizability of the conclusions. Furthermore, researchers can generate and examine new hypotheses using big data, making data usage wider than the original experiments.

Several eye-tracking databases of alphabetic reading have been established, such as the Potsdam corpus9,10, the Provo corpus11, and the Ghent Eye Movement Corpus (GECO)12. These databases have been used in many aspects of reading research, such as examining the impacts of linguistic and other variables on text reading9,10, improving the computational models for alphabetic text reading13,14, and investigating the relationship between first- and second-language reading15,16,17. Recently, corpus analysis has also been used to investigate the mechanisms of Chinese reading5,18,19. However, the existing eye-tracking databases of Chinese reading are relatively small. A larger database is strongly needed, which can be used to investigate the complex cognitive mechanisms underlying Chinese reading and can be more easily compared with eye-tracking databases of alphabetic reading to reveal the similarity and difference between Chinese and alphabetic reading20.

Here we report a sizeable eye-tracking database, the Chinese Eye-Movement Database, which summarizes nine eye-movement measures for over 8,000 different Chinese words. Our database was based on data collected from 57 eye-movement experiments using a sentence-reading task and totally 1,718 participants. Figure 1 presents a schematic of the procedure used to construct the database.

Fig. 1
figure 1

Schematic visualization of word segmentation and measure calculation. Note. Panel (a) shows the procedure of word segmentation for one sentence. Panel (b) shows an example of the procedure of calculating an eye-movement measure (i.e., FFD) on a word (e.g., “沙漠” meaning “desert” in English).

Methods

Data acquisition

Data were obtained from 1,718 participants across 57 experiments. All experiments were approved by and performed in accordance with guidelines and regulations of the Institutional Ethics Committee at the Institute of Psychology of the Chinese Academy of Sciences. All the participants were college students and native Chinese speakers with normal or corrected-to-normal vision. Each participant read and signed the informed consent form before the experiment. In all experiments, native Chinese readers silently read sentences naturally for comprehension, with no special experimental paradigm (e.g., the moving window paradigm or gaze-contingent boundary paradigm) adopted. The eye-tracker was calibrated for each participant during each experiment before the task. The materials were presented on a 21-inch CRT monitor (Sony G520; resolution: 1,024 × 768 pixels) connected to a Dell PC. Participants viewed the stimuli approximately 58 cm away from the monitor. They placed their chin on a chin rest to minimize head movements and read sentences binocularly while only their right eyes were monitored. Eye movements were recorded using an EyeLink 1000 eye-tracking system with a sampling rate of 1,000 Hz.

The materials used in all experiments included 8,015 different natural Chinese sentences. Sentences shorter than 15 characters were excluded. After this, 7,577 sentences remained, with each containing 15–35 characters (mean 22.48). The sentences were all of a high semantic plausibility (i.e., the rating scores were higher than 4.5 on a 7-point scale, where higher scores indicate higher plausibility). This was based on the assessment of the participants who did not participate in the eye-tracking experiments.

Word segmentation

The word segmentation procedure is shown in Fig. 1a. Because there are no explicit markers to demarcate words in Chinese text, we used a package called jiebaR21 in R22 to segment words. Segmentation was performed primarily based on the Lexicon of Common Words in Contemporary Chinese (Draft)23. Words not included in this dictionary were segmented based on the default dictionary in jiebaR21. Afterward, the words were manually checked to correct segmentation errors, particularly in the following three situations. First, overlapping ambiguous strings (OASs) may have been incorrectly segmented. An OAS is a string of characters (e.g., “学生活,” herein referred to as characters A, B, and C, respectively), wherein the middle character can form distinct words with the characters on both its left (e.g., word “学生,” meaning “student” in English) and right (e.g., word “生活,” meaning “life” in English)24,25,26,27,28. In some situations, the software incorrectly segments AB-C as A-BC or segments A-BC as AB-C. Second, the word may have been segmented incorrectly into several words. For example, “马上” (meaning “immediately”) was incorrectly segmented into two one-character words (i.e., “马,” meaning “horse,” and “上,” meaning “up”). In this case, they are adjusted to a single word. Third, phrases may have been treated incorrectly as whole words. For example, a noun–noun phrase, such as “英语文学” (meaning “English literature”) should be segmented into two words, “英语” (meaning “English”) and “文学” (meaning “literature”), which was instead identified as one word.

Pre-processing and calculation of eye-movement measures

The eye-movement data were pre-processed using the EyeDoctor 0.6.5 software developed by UMASS Eye-Tracking Lab. Sentences in which participants made more than three blinks while reading were excluded from the analyses, as were fixations and saccades that contained blinks. Furthermore, fixations longer than 1,000 ms or shorter than 80 ms were excluded.

Eye-movement measures for each word were calculated using the DPEEM package29 in R22. Considering that readers do not always start reading from the first character of a sentence and there are more blinks at the beginning, the first three characters were excluded from the analyses. Moreover, the last three characters in a sentence were excluded from the subsequent analyses to avoid the wrap-up effect30. Words containing any excluded character from the analyses were eliminated. Additionally, the words not listed in the Lexicon of Common Words in Contemporary Chinese (Draft)23 were excluded from the analyses. In total, 8,551 different words were included, including 1,354 one-character words, 6,128 two-character words, 547 three-character words, and 522 four-character words. We calculated nine eye-movement measures for each word. Table 1 presents the definitions and abbreviations of these measures. As shown in Fig. 1b, for each measure of the given word, we first calculated the mean values of each participant. The average mean values and corresponding standard deviations were then calculated across participants. Table 2 shows the descriptive information of the nine measures on words of different length.

Table 1 Definitions and Abbreviations of the Nine Eye-Movement Measures.
Table 2 Mean Value (Standard Deviation) of the Eye-Movement Measures on Words of Different Length.

Data Records

The database is freely available on OSF repository31 under the CC BY 4.0 License. The raw data are provided in the file “Raw Data.txt”, “Sentences.xlsx” and “ROIs.xlsx”.

The descriptive statistics of the eye-movement measures of each of the 8,551 different words are provided in the files named “MainMeasures.xlsx” and “Supplementary Measures.xlsx”). “Main Measures.xlsx” file contains information regarding first fixation duration (FFD), gaze duration (GD), and first-pass reading fixated proportion (FPF), while the “Supplementary Measures.xlsx” file contains information regarding the remaining six measures (for definitions, see Table 1). The following information is available in each file:

  1. 1.

    The column named “words” provides the words for which the eye-movement measures were calculated, e.g., “钱” (meaning “Money” in English).

  2. 2.

    The columns starting with “Mean_” provide the mean values of the eye-movement measures, e.g., the column named “Mean_FFD” provides the mean value of FFD for each word.

  3. 3.

    The columns starting with “SD_” provide the standard deviations (SDs) of the eye-movement measures, e.g., the column named “SD_FFD” provides the SD of FFD for each word.

  4. 4.

    The columns starting with “Numobs_” provide the number of observations of each word on each eye-movement measure, e.g., the column named “Numobs_FFD” provides the number of observations of each word on FFD.

  5. 5.

    The columns started with “Numsub_” provide the number of participants that the eye-movement measures were calculated based on, e.g., the column named “Numsub_FFD” provides the number of participants that the FFDs were calculated based on.

  6. 6.

    The column named “num_sentence” provides the number of sentences that contain each word.

  7. 7.

    The column named “frequency_subtle_based” provides subtitle-based word frequency of the corresponding word32.

Structure of the Raw Data

All raw data are available on the website https://doi.org/10.17605/OSF.IO/94WUE. All sentences and their specific sequence labels (indicated by column named “Sentence_ID”) are available in the file named “Sentence.xlsx”. The file named “Raw Data.txt” contains all raw data. In this file, each row provides information for one fixation observed by a subject during reading. The seven columns provide the following information.

  1. 1.

    The column named “Experiment” shows which experiment the fixation belongs to.

  2. 2.

    The column named “Subject” shows which participant the fixation was observed from.

  3. 3.

    The column named “Sentence_ID” shows which sentence the fixation was observed while reading, which can be used to find the corresponding sentence in “Sentences.xlsx” file.

  4. 4.

    The column named “X_Position” shows the horizontal coordinates of the fixation as measured by characters. The position of the first character of a line is encoded as zero. Fixations that fall outside the scope of sentences are invalid, and their horizontal coordinates are encoded as “−1”. These fixations were not used to calculate eye-movement measures.

  5. 5.

    The column named “Y_Position” shows the vertical coordinates of the fixation as measured by lines of text. Because all sentences were presented within a single line, vertical coordinates of all fixations within the scope of sentences are zero. For fixations out of the scope of sentences, vertical coordinates are encoded as “−1”.

  6. 6.

    The column named “Onset_Time” shows the onset of one fixation (unit: ms).

  7. 7.

    The column named “Offset_Time” shows the offset of one fixation (unit: ms). Fixation duration can be calculated from subtracting “onset” from “offset”.

“ROIs.xlsx” file contains information of words in sentences for each experiment. This information was used in calculating eye-movement measures. The six columns provide the following information.

  1. 1.

    The column named “Experiment” shows which experiment the word belongs to.

  2. 2.

    The column named “Sentence_ID” shows which sentence the word belongs to, which can be used to find the corresponding sentence in “Sentences.xlsx” file.

  3. 3.

    The column named “ROI_Beginning” shows the horizontal coordinates of the first character of the word in the current sentence.

  4. 4.

    The column named “Word_Length” shows the word length.

  5. 5.

    The column named “Word_Order” indicates order of the word in the current sentence.

  6. 6.

    The column named “Words” shows the current word.

Technical Validation

Qualitative validation

The following criteria assured the data quality of the present database. First, all data were collected in the same laboratory using the same protocols and tasks (i.e., silent sentence reading). Second, the participants recruited in the experiments were all college students and native Chinese speakers with normal or corrected-to-normal vision. Third, eye-movement measures were calculated using the previously validated analysis procedure. Together, these homogeneities minimize the variation of the experimental environment, tasks, procedures, and participants.

Quantitative validation

To quantitatively validate the database, we analyzed the impacts of word frequency and word length on three primary measures—FFD, GD, and FPF to examine whether the classic findings of small-scale experimental eye-tracking studies can be replicated using our database. These effects are well demonstrated3,5 and have often been used to validate computational models for reading33,34. We examined the effects in the current database by fitting a general linear model for each measure with log-transformed word frequency and word length as predictors. Word frequency was obtained from SUBTLEX-CH32, and was treated as a continuous variable, and word length was treated as a factor variable, with successive differences coding adopted. As shown in Table 3, the word frequency and word length effects were replicated in the current database. Words with higher frequency received shorter FFD, shorter GD, and lower FPF. The longer words received shorter FFD, longer GD, and higher FPF.

Table 3 Results for the Effects of Word Frequency and Word Length on the Main Eye-Movement Measures.

Considering that the number of observations of a word may substantially impact the data reliability of it, we re-conducted the analyses above by dividing the words into quarters based on the number of observations for each measure. Table 4 shows the lexical information for each quarter, and Supplementary Table 1 shows the results. There were expected word frequency and word length effects in each quarter, even in quarters where words had fewer observations (i.e., Quarter 1 and Quarter 2).

Table 4 Lexical Information of the Four Quarters of Words Divided Based on the Number of Observations.

In addition to the subtitle-based word frequency, we also used the word frequency from the Chinese Linguistic Data Consortium (2003) corpus to perform the same analyses above. The results are shown in Tables 5, 6 and Supplementary Table 2, which is similar to those using frequency from SUBTLEX-CH32 and thus also validated the current database.

Table 5 Results for the Effects of Word Frequency and Word Length on the Main Eye-Movement Measures.
Table 6 Lexical Information of the Four Quarters of Words Divided Based on the Number of Observations.

Usage Notes

The current database is available at OSF repository31. This database can contribute to understanding the cognitive mechanisms underlying Chinese reading in several ways. First, the current database can be analyzed to test new theoretical hypotheses regarding Chinese reading. Second, it can be used to find the optimal parameters for new computational models of Chinese reading and can provide benchmark data to evaluate them. Third, the current database, combined with the existing eye-tracking databases of alphabetic reading, can be used to investigate the mechanisms of reading cross-linguistically20. Finally, the large-scale eye-movement measures reported in the database can serve as indicators of word-processing difficulty in Chinese text reading. Thus, it can be used to control or manipulate the difficulty level of reading stimuli, which is valuable in scientific research and potentially helpful for selecting suitable reading materials for readers with different literacy skills.