Target sentence structures
We constructed 24 grammatically correct English target sentences with double-embedded object-relative clauses. Twelve of these (the semantically “neutral” sentences) had nouns and verbs that allowed for many meaningful combinations of agent, action, and patient. The nouns and verbs of the other 12 sentences (semantically “biased”) were such that most (if not all) unintended combinations are not meaningful. For each sentence, an ungrammatical version was created by removing the second verb phrase. Hence, there were 2 (Grammaticality) × 2 (Semantics) = 4 item conditions. Table 1 presents one example in each condition. The full list of target items is presented in the Appendix.Footnote 1
The semantically biased and neutral sentences were based on the stimuli from Gibson and Thomas (1999) and Frank et al., (2016), respectively. Some of Gibson and Thomas’s items contained words that we suspected to be unknown to many of our non-native participants. These words (e.g., ‘snubbed’) were replaced by better known alternatives. Adverbs were added to the Frank et al., sentences to make them more similar to the semantically biased ones.
We had no reason to expect an interaction between Grammaticality and Semantics. The missing-VP effect has been demonstrated in both semantically neutral (Christiansen & MacDonald, 2009; Frank et al., 2016; Vasishth et al., 2010) and semantically biased sentences (Christiansen & MacDonald, 2009; Frank & Ernst, 2019; Gibson & Thomas, 1999), although reading-time studies have been restricted to neutral sentences (at least in English). The shallow structure hypothesis (Clahsen & Felser, 2006a, b) claims that non-native readers rely more on semantic cues when syntactic structure is highly complex. If so, we might expect our L1 Dutch participants to show a weaker missing-VP effect (or even no such grammaticality illusion) on the semantically biased sentences.
Regions of interest
We defined two regions of interest (RoI), as shown in the Appendix for each target stimulus. Region V3 comprises the final verb or auxiliary-verb pair (e.g., ‘was missing’ in Table 1) and region post-V are all words following V3 (e.g., ‘an important page.’ in Table 1). Vasishth et al., (2010) found a missing-VP effect on re-reading times in both these regions.
For consistency with Vasishth et al., (2010), we had originally also included a V1 region consisting of the first verb or auxiliary-verb pair, even though grammatical and ungrammatical sentences are the same up to and including this RoI so first-pass effect of Grammaticality are not expected here. However, the length of the text line following V1 confounds with Grammaticality (see Table 1 and the upcoming paragraph), which may lead to an illusory effect of Grammaticality. Because we indeed found that the length of the remaining text line affects V1 reading time, we do not report results on the V1 region here. They are available as supplementary materials from https://osf.io/ye6dj.
Target sentences did not comfortably fit on a single screen line, so line breaks had to be inserted. To make sure that the screen position of the critical V3 and post-V regions was approximately the same for grammatical and ungrammatical items, the location of the line break differed between Grammaticality conditions: within the middle verb phrase for grammatical sentences but shortly (or immediately) after the first main verb for ungrammatical items. There was at least one content word to the left of the V3 region, discouraging the return sweep from landing inside the V3 region. Long words or phrases in the original stimuli (Frank et al., 2016; Gibson and Thomas, 1999) were shortened if required to make the first line fit on screen (e.g., ‘ancient manuscript’ was replaced by ‘book’).
As can be seen in Table 1, the spatial location of the critical regions was not always identical between Grammaticality conditions. On average, they were positioned 1.8 characters (SD = 3.1) more to the right in the grammatical condition. In principle, this could have an influence on the main effect of Grammaticality. However, there is no reason to expect that it will affect the critical interactions with native language, L2 English proficiency, and L2 English exposure.
We constructed a first stimulus list with 24 target items (six from each condition) that were evenly distributed among 96 filler sentences. A second list was identical to the first but with the opposite Grammaticality conditions. Two further lists were created by reversing the order of the first two lists. Participants were assigned randomly to one of the four lists.
Thirty sentences (six targets and 24 fillers) were paired with a yes/no comprehension question intended to ensure participants read attentively. No two consecutive items appeared with a comprehension question.
The experiment was completed by 197 participants. Following the pre-registered analysis, 14 of these were excluded because they scored below 70% correct on the comprehension questions; one additional participant was excluded because of persistent calibration failure. Of the remaining participants, 58 were native English speakers (L1 English group: 41 females, 11 males, six other/unknown; mean age 20.2, range 18–26). The other 124 were native Dutch speakers with English as a second language (L1 Dutch group: 84 females, 40 males; mean age 22.3, range 18–57). To obtain a wide range of L2 English exposures for this rather homogeneous group, we explicitly recruited participants studying in bachelor or master programs that are (nearly) exclusively taught in Dutch or fully taught in English.
All participants filled out a language background questionnaire and completed the Vernon-Warden reading test (VWRT; Hedderly, 1996) for English reading proficiency, which is a timed test of increasingly challenging fill-in-the-blanks multiple choice questions.
Figure 1 shows how the two groups performed on three dimensions of English proficiency: VWRT score, reading speed (operationalized as average total reading time on filler sentences), and comprehension accuracy (percentage of errors on comprehension questions). The L1 English group read faster and scored higher on the VWRT, although both measures show considerable overlap between groups. There is no difference between groups whatsoever in comprehension accuracy, and error rates were similar to those reported by Vasishth et al., (2010) (L1 English participants) and Frank et al., (2016) (L1 Dutch participants tested in L2 English).
L2 English proficiency and exposure
The L1 Dutch participants’ VWRT results and language background questionnaires were used to obtain scores of their L2 English proficiency and exposure/use. The overall proficiency score was computed by running a principal component analysis (PCA) over z-scores of five separate L2 English proficiency measures:Footnote 2 VWRT scores and self-rated (seven-point scales) English proficiency in speaking, listening, reading, and writing.
PCA is a well-known method for summarizing high-dimensional data in a lower number of dimensions, called PCA components. These components are ordered by how much unique variance in the original data they explain: The first PCA component explains the largest amount of variance. Hence, we took this first component to comprise the participants’ overall Proficiency scores; it explained 62% of variance in the five proficiency measures. As an additional output of PCA, each of the original measures receives a so-called factor loading indicating to what extent the measure contributes to the different PCA components. For the first PCA component that we use as a proficiency measure, the factor loadings on the four self-rated proficiencies were approximately equal (0.43 − 0.50) and slightly higher than the factor loading on VWRT scores (0.35).
Similarly, an overall L2 English exposure measure was computed by running a PCA on the questionnaire data related to amount of exposure to (and use of) English as a second language: self-rated (seven-point scales) amount of English used for speaking, listening, reading, and writing; number of years since first exposure to English and since first formal schooling in English; estimated hours per week English use in classes, reading for study, reading for leisure, and listening/speaking; estimated percentage of reading in English for study and for leisure; and number of English-taught classes in the current study year.Footnote 3 The first PCA component comprised the participants’ overall Exposure scores; it explained 46% of variance in the 13 exposure measures. Factor loadings were between 0.22 − 0.34 for all measures except for number of years since first English exposure and schooling, which both had slightly negative loadings.
Validating the proficiency and exposure measures
As expected, Proficiency and Exposure were positively correlated (r = .61 over all L1 Dutch participants; r = .62 over participants included in the analysis; see Fig. 2).
If the Proficiency score is indeed a valid measure of true L2 English proficiency, it should correlate positively with reading speed and/or comprehension accuracy. Whether this is the case of the Exposure score is less clear because any positive effect of increased L2 English exposure may already be incorporated in the Proficiency score. To validate the two scores, we looked at the correlations between each of the Proficiency and Exposure scores on the one hand, and the two behavioural measures (log-transformed average filler sentence RT and percentage of errors on comprehension questions) on the other. As shown in Table 2, higher Proficiency leads to fewer errors and higher Exposure leads to shorter reading times. However, the partial correlations, where Exposure is partialled out from Proficiency or vice versa, show that the relation between Exposure and reading time does not survive the correction for Proficiency. We will return to the issue of Proficiency/Exposure validity in the Discussion.
All L1 Dutch speakers and eight of the L1 English speakers were tested at the Centre for Language Studies lab of Radboud University, Nijmegen. They received €10 or course credit for their participation. All other L1 English-speaking participants were tested at the Multimodal Multilingual Language Processing lab at the University of Birmingham. They received £7 for their participation.
Participants were seated with their head in a chin rest, at a distance of 50 cm from the SR Research EyeLink 1000+ eye tracker. An instruction screen then informed the participants they would read 120 sentences, one at a time, with a break halfway. Participants were instructed to look at the fixation point until the sentence appeared and to read it in a natural fashion. After reading the sentence, they had to press the space bar and answer the yes/no question (if any) by means of a key press. After successful nine-point calibration, five practice sentences with two practice questions followed. After 60 trials, the participants were given the opportunity to have a break. After the break, another calibration was performed, and the participants proceeded with reading the remaining 60 sentences.
Each trial consisted of a fixation point on the left side of the screen, where the first word of the sentence would appear. This fixation point was simultaneously a correction for small drifts in the gaze position. The sentences appeared when the gaze approached the fixation point close enough for the experimenter to accept the drift correction. The participants pressed the space bar when they had read and understood the sentence. Stimuli were presented in 18-point Calibri font. If a sentence was followed by a question, the word ‘question’ was presented with the question underneath, and below that the words ‘yes’ and ‘no’ with their corresponding response keys (‘z’ and ‘m’, respectively).
The eye tracking took approximately 30 minutes, including set up and calibration. Following this phase of the experiment, participants filled out the background questionnaire and completed the VWRT. A complete session could take up to 1 hour. The study was approved by the Ethics Assessment Committee Humanities of Radboud University.
The EyeLink tracker software automatically assigns fixations to words. However, because of drifts or imperfect calibration, fixations can systematically land too far above or below the text to be assigned to a word. For this reason, all fixations were checked by a research assistant. Using the software Fixation (Cozijn, 2006), these unassigned fixations were moved vertically to assign them to words. Such adjustments were rarely required: only 0.12% and 0.05% of fixations were reassigned for the L1 Dutch and L1 English participants, respectively. Trials were marked as not usable if it could not reliably be determined which fixations belonged to which words. These trials (2.55% for L1 Dutch, 1.15% for L1 English) were excluded from analysis.
Reading time measures
We analyzed effects on one early and one later reading time (RT) measure: first-pass RT and regression-path RT. The regression-path RT for a RoI is defined as the time between the first fixation in the RoI and the first fixation on a word to the right of the RoI. Both first-pass and regression-path RT are 0 if (and only if) the RoI is skipped in first pass. These data were excluded from analysis. Non-zero RTs were only excluded from analysis if they were extremely long: Over 4 s for first-pass RT and over 20 s for regression-path RT. Regression paths may include multiple re-readings over long stretches of the sentence, leading to very long, but not unrealistic, RTs. The percentages of data points that were excluded because of extreme RTs can be found in Table 3.
For each RT measure and each RoI, two Bayesian mixed-effects regression models were fitted using the R package brms (v.2.8.0, Bürkner, 2017). The first model compared the L1 Dutch and L1 English participant groups, the second analyzed effects of Proficiency and Exposure within the L1 Dutch group.
Between-group analyses included factors for Grammaticality, Semantics, and L1, plus all two-way interactions and the three-way interaction between them. Subject and item were included as random effects, with by-subject random slopes of Grammaticality and Semantics and by-item random slopes of Grammaticality and L1. For the L1 Dutch group analysis, we included Grammaticality, Semantics, Proficiency, and Exposure as predictors. Two- and three-way interactions were included if they did not contain both Proficiency and Exposure. Further, there were by-subject random slopes of Grammaticality and Semantics and by-item random slopes of Grammaticality, Proficiency, and Exposure.
Proficiency and Exposure scores were standardized (z-scores). The factor levels for Grammaticality, Semantics, and L1 were coded as − 0.5 (Ungrammatical, Neutral, Dutch) and + 0.5 (Grammatical, Biased, English). This means that a non-negative coefficient of Grammaticality is indicative of the missing-VP effect (no shorter RTs in grammatical than ungrammatical sentences) and a positive interaction between Grammaticality and L1 means that the effect is stronger for the L1 English group, as predicted by the high-exposure account. The high-exposure account further predicts a positive interaction between Grammaticality and Exposure, while the low-proficiency account predicts a negative interaction between Grammaticality and Proficiency.
Because RTs are always positive and their distribution right-skewed, we should not assume normally distributed data. Posterior predictive checks (see supplementary materials) revealed that a Gamma distribution was most appropriate for the post-V region while a shifted log-normal distribution worked best for the V3 region, so these two distributions were assumed for the respective regions.Footnote 4
The number of iterations for model fitting was set to 3000 (of which 1000 are warmup) and control parameters were set to the brms defaults. Priors, too, were the brms defaults except for the intercept priors. which were normally distributed with means of 6 (on a log-ms scale) for first-pass RT on the V3 region, 7 for regression-path RT on V3 and first-pass RT on post-V, and 8 for regression-path RT on post-V. All intercept priors had a standard deviations of 1.
Deviations from preregistration
As listed below, our analysis deviates in several respects from what was preregistered (see https://osf.io/ye6dj). Results from the preregistered analysis are presented in the supplementary materials. These yield the same conclusions as the current analysis.
After preregistration, we had the opportunity to test 20 additional L1 Dutch participants. Considering we aim to detect (possibly very small) effects of individual differences between L1 Dutch participants, we chose to include the additional participants. Bayesian data analysis (unlike traditional null-hypothesis testing) allows for incrementally increasing the amount of data.
According to the preregistration, only re-reading time (equal to total RT minus first-pass RT) would be used as a later RT measure. Re-reading times of 0 occur when a RoI is not fixated on after first pass. However, it turned out that over 50% of re-reading times would be excluded for this reason. This motivated us to use regression-path RT instead, which yields much less data loss. Liversedge et al., (1998) recommend both re-reading time and regression-path RT as measures of recovery from reading difficulty.
The preregistration claims that all priors were the brms defaults, but this was not the case for the intercepts.
Gamma distributions (as recommended by Lo and Andrews, 2015) were preregistered but one anonymous reviewer noted it is more common to assume a log-normal distribution for reading times. As described above, we settled on the shifted log-normal distribution for the V3 region.