Language in science performance: do good readers perform better?

Science performance is highly affected by students ’ reading comprehension. Recently, there has been a growing attention to the role of linguistic features for science performance, but findings are ambivalent, especially when looking into item word count. The aim of this study was to investigate the interaction of students ’ reading comprehension and item word count of given science measures on performance, controlling for students ’ cognitive abilities, gender, family language, and school track. The sample consisted of N = 2051 German students in grades 10 and 11. Students completed (scientific) literacy measures. We then applied a multilevel logistic regression to investigate the hypothesized interaction effect of reading comprehension and word count on students ’ science performance. The results showed a significant interaction of students ’ reading comprehension and word count on science performance, controlling for several covariates. Particularly students with high reading comprehension benefit from science items with increasing word count. Our findings empirically support previous research, showing that reading comprehension is crucial for science performance and enhances the interaction between reading comprehension and linguistic features of written text in science subjects. Finally, theoretical and practical implications and limitations of this study are discussed.

changed, and communicated with the help of language (Yore et al. 2003). Therefore, it is crucial for scientific achievement to be literate (Dempster and Reddy 2007;Härtig et al. 2015;Maerten-Rivera et al. 2010;Otero et al. 2002;Ozuru et al. 2009;Voss and Silfies 1996). The importance of reading comprehension for science measures is demonstrated by remarkable correlations of students' performance on science tasks and their reading competence. The correlations vary between 0.58 and 0.90 across studies (Cromley 2009;O'Reilly and McNamara 2007), which can be considered moderate to high correlations between two constructs (Cohen 1992). Therefore, students with low reading comprehension are in unfavorable circumstances in comparison to students with average or high reading comprehension, regarding their science performance. Students with a good science competence may be hindered by their (deficits in) reading comprehension to show their full potential in scientific measures.
Moreover, science performance does not only depend on students' reading comprehension but also on characteristics of science texts and test items (e.g., Bird and Welford 1995;Prophet and Badede 2009). Although the importance of linguistic features and their complexity for solving written tasks is undeniable, only a few studies have yet focused on the influence of specific linguistic features on science performance. There have been two major approaches pursued by researchers to investigate the effect of linguistic features on item difficulty and students' performance: the modification of specific linguistic features in science measures and secondary analyses.
Regarding the modification of linguistic features, it was pursued in several studies to simplify science items based on guidelines and recent findings (e.g., Cassels and Johnstone 1984). For example, it has been shown repeatedly that a high density of technical terms in science textbooks and test items tends to generate difficulties for students, which is linked to lower performance (e.g., Butler et al. 2004;Prenzel et al. 2002;Schmiemann 2011;Snow 2010). Other linguistic features that have been frequently shown to affect performance in various domains, such as mathematics and reading, have been the passive voice phrase (Berndt et al. 2004;Kharkwal and Stromswold 2014), the negation of sentences (e.g., Tamir 1993), pronouns (e.g., Oakhill and Yuill 1986), and many more. Several studies find significant effects when simplifying the language used in science items (e.g., Bird and Welford 1995;Kettler et al. 2012;Prophet and Badede 2009;Siegel 2007) and two meta-analyses confirm the effectiveness of linguistic simplification for English language learners (Kieffer et al. 2012;Pennock-Roman and Rivera 2011). Linguistic simplification is even able to reduce performance gaps between English language learners and native speakers (Rivera and Stansfield 2004). However, some studies also report non-significant results and only observe improvements of performance in descriptive analyses (e.g., Höttecke et al. 2018, while other studies even report contradictory findings with simplification not always working out as initially hoped (e.g., Leiss et al. 2017;Rivera and Stansfield 2004). Often, it remains unclear why simplifying the language in science test items sometimes fail to help students in solving the items.
Regarding secondary analyses, research found evidence for even more linguistic features that may generate difficulties for students resulting in lower science performance. Heppt et al. (2015) identified linguistic features such as general academic words, words with three or more syllables, and compounds as difficult for students. Interestingly, word count, complex sentence structures, and noun phrases were reported as non-significant. Dempster and Reddy (2007) found that complex sentences and unfamiliar words predicted response patterns the best: Both complex sentences and unfamiliar words seem to generate difficulties for students, resulting in lower performance. A further analysis of the items where students performed well on showed that these items tended to be shorter, while more difficult items had more prepositional phrases, noun phrases, passive constructions, and words with multiple meanings. It remains unclear where the difficulty for students exactly arises. Stiller et al. (2016) further identified word count in the item stem and the responses as significant for item difficulty, while Prenzel et al. (2002) did not find a significant effect for word count.
While the majority of studies show consistent results for linguistic features such as technical terms (e.g., Butler et al. 2004;Prenzel et al. 2002;Snow 2010;Stiller et al. 2016), findings get a lot more ambivalent, when looking into item word count. As stated above, secondary analyses are not consistent: Some studies report significant effects for word count (Stiller et al. 2016), while others do not (Heppt et al. 2015;Prenzel et al. 2002). With regard to experimental studies, Cassels and Johnstone (1984) have shown that an increased word count with embedded clauses resulted in an overall increase of item difficulty and consequently in a decrease of students' performance on a chemistry test. In this case, it remains unclear if the effect arises from the word count or the embedded clauses, which increase sentence complexity. Prophet and Badede (2009), on the other hand, have shown that an extended reduction of overall word count does not necessarily result in higher science performance, demonstrating that an excessive reduction of word count leads to a loss of information, which in turn influences students' performance negatively. The researchers agree that for an effective simplification of science items by reducing word count, it is necessary to remove irrelevant information only. Bird and Welford (1995) also conducted a study in which they modified, or rather simplified, questions of a science exam by inter alia reducing the word count and substituting words, finding no statistically significant effects on students' overall performance. The effects of simplification leading to better performance only became significant when investigating non-native students. Still, it remains unclear again, if the better performance can be explained by the word count or the substitution of words. Several experimental studies in which linguistic features in science items, such as word count and sentence structure were modified, report ambivalent effects on students' performance (Höttecke et al. 2018;Rivera and Stansfield 2004;Abedi et al. 2003). However, it is not stated which factor specifically contributed the most to linguistic simplification and better students' performance.
In sum, prior investigations were able to identify multiple linguistic features that may have a significant effect on item difficulty for students. While the majority of these findings are consistent, the scientific evidence for the effect of word count remains vague. It seems apparent that word count may make a difference for students' science performance and item difficulty based on prior research and reported findings (e.g., Bird and Welford 1995;Cassels and Johnstone 1984;Stiller et al. 2016). However, it remains unclear if word count is the driving force in those effects or rather other linguistic features such as sentence structure, which is often modified simultaneously with word count.

The present investigation
In this study, we want to focus on the relation of science items' word count and students' reading comprehension due to the ambivalent findings of word count on item difficulty. We looked further into the effect of word count on item difficulty since findings have been very ambivalent (e.g., Bird and Welford 1995;Cassels and Johnstone 1984;Prophet and Badede 2009;Stiller et al. 2016). It may be possible to shed some new light on the ambivalence of findings by taking interaction effects of word count and students' reading comprehension into account. Students with good reading comprehension are proficient in extracting the main substance of a written text (Gibson and Levin 1975). Hence, it seems possible that these proficient students are able to better extract the relevant information of a science item with a higher word count than students with less proficient reading comprehension are.
To investigate our research question, we examined three hypotheses. First, we aimed to support the finding that students' reading comprehension has a positive effect on science performance, meaning that science performance increases with higher reading comprehension (e.g., Bayat et al. 2014;Cromley 2009;O'Reilly and McNamara 2007).
Second, to examine the effect of word count, we analyzed science items with respect to their word count, hypothesizing that the word count of science items affects the probability of a correct response for respective items. To avoid confounding effects, we only included items with closed-response formats. Open-response formats were excluded due to the additionally required productive language skills (Brown and Hudson 1998) and due to the effects of response formats on performance (DeMars 2000;Härtig et al. 2015;Reardon et al. 2018). Furthermore, items with pictures were excluded, because it is known that processing texts with pictures may have favorable effects on reading comprehension (Carney and Levin 2002;Mayer 1989Mayer , 2001Mayer and Gallini 1990).
Third, we hypothesized an interaction effect between item word count and students' reading comprehension on science performance that may explain former ambivalent findings (e.g., Cassels and Johnstone 1984;Stiller et al. 2016). In detail, we assumed that especially students with high reading comprehension benefit from increased item word count, leading to better science performance due to their better competence of extracting the information relevant for solving the science items correctly.
Finally, we controlled for (1) cognitive abilities, (2) gender, (3) family language, and (4) school track, when testing the relation between reading comprehension and item word count. First, while cognitive abilities are related in a moderate to high degree with reading comprehension (Kendeou et al. 2015;Naglieri 2001), cognitive abilities also correlate with science performance (0.46 to 0.77; Deary et al. 2007). Therefore, we included students' cognitive abilities in our model, expecting that the interaction effect of reading comprehension and word count remains significant after controlling for students' cognitive abilities. Second, we controlled for students' gender because male students tend to slightly outperform females in science measures (e.g., Ivanov and Nikolova 2010) and (b) females usually outperform male students on reading comprehension measures (e.g., Wendt et al. 2010). Third, we controlled for students' family language, because research suggests that students with a migratory background tend to show lower performances than native speakers (e.g., Heppt et al. 2015;Turkan and Liu 2012). Finally, we controlled for students' school track, because it has been suggested that students visiting the academic track tend to outperform students visiting the non-academic track in reading comprehension and science measures (e.g., Ivanov and Nikolova 2010;Wendt et al. 2010).

Sample
The data used in this study were retrieved from the larger study "Competencies and attitudes of students" (KESS, in German: Kompetenzen und Einstellungen von Schülerinnen und Schülern) conducted in Hamburg, Germany (for a description of KESS see Vieluf et al. 2011). KESS is a longitudinal study that started in 2003 and comprises five occasions of data collection. Even though KESS has finished in 2012, re-analyzing the KESS data is still valid: Neither the science curriculum in Hamburg, Germany, has changed considerably since the KESS study, nor have the tests used in educational assessment.
In the current study, we used data from the fourth data collection in 2009, where N = 13,328 teenagers at the end of grade 10 or the beginning of grade 11 were tested. Students, who visited the comprehensive school (non-academic track), were tested in grade 10 before the summer break, while students visiting the German Gymnasium (academic track) were tested shortly after the summer break in grade 11. If students were tested twice due to a change of school over the summer break, only their first assessment was incorporated in the data and analyzed.
The analytic sample for the current study was limited to those students who worked on the science tests; students not working on these tests due to the test design were excluded. Students that did not take part in the reading comprehension and cognitive ability tests were also excluded. Consequently, there were no missing data in the sample in the assessed competence measures due to the specific selection of our analytic sample. Missing answers on single items were coded as false. The final sample consisted of 2051 students (age: M = 16.5, SD = 0.53; 51.9% female) from 52 schools in Hamburg, Germany, which were tested in the non-academic track in grade 10 (43.3%) or in the academic track in grade 11 (56.7%). Most students only spoke German at home (84.7%), while a minority used German and another language or just another language at home (15.3%). Students' demographics were obtained by questionnaires. All measures were assessed in German.

Measures
Science The science test was based on the scientific-literacy concept, which measures the competence to apply scientific knowledge appropriately and accordingly to a specific situation (cf. Vieluf et al. 2011). The tests for grades 10 and 11 consisted of 46 items in total. The items stem from the Third International Mathematics and Science Study conducted in 1995 (TIMSS/ III; Baumert et al. 2000).
For our study and its specific research question, items with pictures connected to the individual task (n = 11) and with an open-answer format (n = 7) were excluded from the analyses due to possible confounding of additional factors, in particular required productive skills and additional effects on reading comprehension (e.g., Brown and Hudson 1998;Mayer 1989Mayer , 2001. Pictures in science items have been found to affect science performance, depending on the type of picture. Decorative pictures, for example, do not affect science performance, but other types of pictures, such as transformational ones, can enhance systematic thinking, which in turn, may affect performance positively (e.g., Levin et al. 1987). The remaining 28 items were multiple-choice items with three to five alternative answers, including items from different subjects: biology (n = 9), chemistry (n = 5), earth sciences (n = 6), environmental issues and the nature of science (n = 3), and physics (n = 5).
The science items were coded by two independent raters with regard to their word count. For the coding, the number of words in the item, including the question and the multiplechoice answers, was counted and summarized (M = 36.9, SD = 20.8). Cohen's kappa was 0.95 indicating an excellent degree of agreement of the coded word count (Cohen 1960). The slight deviation in coding was a result of initial uncertainties regarding symbols in the said items. Moreover, the science items in KESS were relatively simple in sentence structure: There were almost no embedded or subordinate clauses with the sentence structure mainly consisting of main clauses. Negations, pronouns, noun phrases, and nominalizations were kept to a minimum.
All students in our analytic sample completed the 28 items. Students' answers on each item were coded as correct or false: Missing answers on single items were coded as false. In a Rasch model for dichotomous data, weighted likelihood estimates (WLEs; Warm 1989) were estimated as students' science performance scores and standardized (M = 100, SD = 30), using ConQuest (Wu et al. 1998). In ConQuest, item and person parameters are estimated with the marginal maximum likelihood (MML) method, which provides reliable estimates (Rost 2004). The WLE reliability of the science performance was sufficient (0.72).
The WLEs were primarily used to calculate the correlations of students' science performance and reading comprehension and cognitive abilities, respectively. To investigate the research question, students' answers on each item (0 = false answer, 1 = correct answer) were used in the statistical analyses.
Reading comprehension The assessment of students' reading comprehension consisted of eight texts (two factual texts, two stories, three newspaper articles, and one text composed of graphics) with 58 tasks, using mostly multiple-choice answer formats with four alternative answers and a few open-answer formats. The tasks were rotated systematically between the booklets with a multi-matrix design. Therefore, the booklets varied in their number of items, ranging from 36 to 39 items, all of them including 24 anchor-items.
WLEs (Warm 1989) were estimated as students' reading comprehension, using ConQuest (Wu et al. 1998). The WLE reliability of the reading comprehension tasks was good (0.82).
Cognitive abilities Students completed a non-verbal subtest of a German test to assess children's and adolescents' cognitive abilities (in German: Kognitiver Fähigkeitstest; Heller and Perleth 2000), which is based on Thorndike and Hagen's Cognitive Abilities Test (Thorndike and Hagen 1971). According to Heller and Perleth (2000), the construct validity is given with the non-verbal subtest and as per test manual, the internal consistency is 0.93, indicating an excellent estimate of reliability of the test scores. Students had 8 min to process the subtest, which consists of 25 items. This subtest is considered to be a fair indicator of general cognitive abilities (Neisser et al. 1996).
Since cognitive abilities have been tested at neither the end of grade 10 nor the beginning of grade 11, we used students' performance from the previous data collection in grade 8. Using ConQuest (Wu et al. 1998), WLEs (Warm 1989) were estimated for students' cognitive abilities. The WLE reliability of the cognitive ability test was good (0.82).

Statistical analyses
We analyzed how the interaction of word count and students' reading comprehension affects the probability of a correct response on science items applying multilevel logistic regression modeling using the software Mplus 8.3 (Muthén andMuthén 1998-2015). We conducted a multilevel regression model due to the structure and hierarchy of the given data. We applied multilevel modeling since we considered the science items with varying word count as being nested in students. Furthermore, we applied logits regression in our multilevel model due to the binary nature of our dependent variable science performance on item level (0 = false answer, 1 = correct answer). Moreover, in our multilevel model, the item characteristic word count was treated as a within-level variable (on level 1). Person characteristics, i.e., students' reading comprehension and other covariates (cognitive abilities, gender, family language, and school track), were treated as between-level variables (on level 2).
Finally, to test the effect of the interaction between students' reading comprehension and items' word count on science performance, a cross-level interaction between these variables was included in our model. To test for the role of cognitive abilities, we further included a cross-level interaction between students' cognitive abilities and item word count on science performance. Furthermore, we controlled for three dummy-coded variables: gender (0 = male, 1 = female), family language (0 = German, 1 = family language German and another language or only other than German), and school track (which equals the time of assessment; 0 = nonacademic track, 1 = academic track). All predictors but the dummy-coded variables on withinand between-level were standardized (M = 0, SD = 1) for an easier interpretation of the effects and odds ratios were calculated for the effects.

Descriptive statistics and correlations
Since measuring science performance involves knowledge from five domains (biology, chemistry, earth sciences, environmental issues and the nature of science, and physics), we tested if the according items differed in their word count. We did not find significant differences, F(4,23) = 0.44, p = 0.777. The probability of a correct response for the items measuring science performance ranged from 0.21 to 0.93 (M = 0.56, SD = 0.16).
The correlations between students' science performance, reading comprehension, and cognitive abilities are presented in Table 1. All three variables correlated positively with moderate to large effect sizes. Table 1 further contains the correlations of these variables with the remaining covariates gender, family language, and school track.
Moreover, we tested differences between students from the academic vs. non-academic track. Academic track students outperformed non-academic track students in science performance (d = 1.08), reading comprehension (d = 1.33), and cognitive abilities (d = 0.88). Finally, the ICC for the science items indicated that 10.3% of the variance in science performance is due to differences between students.

Multilevel analyses
The results of our multilevel model, including the unstandardized coefficient (B), the standard error (SE), the p value, the odds ratio (OR), and the confidence intervals (CI), are presented in Table 2. Contrary to our prediction, the item word count was not a significant predictor on the within-level (level 1). However, reading comprehension and cognitive abilities proved to be significant predictors on the between-level (level 2), indicating a higher probability of a correct response with increased reading comprehension and cognitive abilities. Moreover, the remaining covariates (gender, family language, and school track) also were significant predictors on the between-level.
These main effects, however, need to be interpreted in light of the interaction between reading comprehension and word count, which was significant. Figure 1 portrays the interaction effect (black line) and the 95% confidence intervals (gray line). The x-axis of the plot depicts students' range of values for their reading comprehension from − 2 SD to + 2 SD, while the y-axis depicts the range of values for the conditional slope of science performance on items' word count (B science performance on word count ). The x-axis depicts standardized values (M = 0, SD = 1) while the y-axis depicts the unstandardized regression coefficient B. The increasing black line of the interaction effects shows that students with higher reading comprehension benefitted from increased item word count, which led to increased science performance. However, we also need to take the confidence intervals into account, because they help understand on which level of reading comprehension the interaction effect on science performance is significant (also known as region of significance, cf. Johnson and Neyman 1936). In Fig. 1, you can see that the confidence intervals do not cross the x-axis anymore just over − 1 SD, which indicates that the interaction effect becomes significant at that point. This means that particularly for students with very low reading comprehension levels (below − 1 SD), there are no effects of word count on science performance. For students with higher reading comprehension, the interaction effect became significant, which means that they scored significantly higher on science items with higher word count. Students' ability scores were weighted likelihood estimates (WLEs).The WLE for science was not used in our regression models, but only to present a more general impression of the correlation of reading comprehension, cognitive abilities, and science. The following variables were dummy-coded: gender (0 = male, 1 = female), family language (0 = German, 1 = family language German, and another language or only other than German), and school track (0 = non-academic track, 1 = academic track). *p < 0.05; **p < 0.01 The following variables were dummy-coded: gender (0 = male, 1 = female), family language (0 = German, 1 = family language German, and another language or only other than German), and school track (0 = non-academic track, 1 = academic track) Language in science performance: do good readers perform better?

Discussion
The aim of this research was to investigate the effect of the relation between the word count in science items and students' reading comprehension on science performance. We expected the interaction of reading comprehension and word count of science items to affect science performance and examined our hypothesis by applying a multilevel logistic regression model drawing on a relatively large student sample. Our hypothesis was corroborated: The interaction between word count and reading comprehension affected science performance significantly even when controlling for students' cognitive abilities, gender, family language, and school track. In detail, the interaction indicated that particularly students with good reading comprehension benefitted from increased item word count in the investigated science measures, while there are no interaction effects on students with low reading comprehension. One could argue that the interaction is relatively small. However, it needs to be taken into consideration that interaction effects are difficult to detect in non-experimental studies, because they are usually quite small in the psychological research field (Champoux and Peters 1987;Chaplin 1991) and due to other reasons concerning assessment, such as unreliability of measures and the lack of pronounced profiles (e.g., Busemeyer and Jones 1983;McClelland and Judd 1993). Large sample sizes, such as in our analytic sample, help to detect these interaction effects if they truly exist (Trautwein et al. 2012) and, therefore, even small interaction effects can be important and should be interpreted in the research context (Chaplin 1991).

Theoretical and practical implications
Our findings contribute to the growing body of literature showing the importance of reading comprehension for science performance (e.g., Hall et al. 2014;O'Reilly and McNamara 2007). Furthermore, the significant interaction between reading comprehension and word count underlines that linguistic item characteristics may not have the same consequences for different test takers, but vary as a function of the test takers' prerequisites. Concerning the assessment of science performance, this result may be interpreted from at least two perspectives. First, one may argue that tests measuring science performance should not depend too strongly on the test takers' language skills, since this might systematically disadvantage particular social groups, such as non-native speakers (Bird and Welford 1995;Lee 2005;Prophet and Badede 2009) and thus may be seen as a serious threat to test validity. Second, language skills are often discussed as something inherent to scientific literacy (e.g., Härtig et al. 2015;Yore et al. 2004) and, thus, language skills may also be seen as an integral part of science performance measures. Regardless of the implications for science performance assessment, our findings demonstrate once again the importance of reading comprehension and linguistic features in students' academic life. While good readers benefit from their reading skills in other subjects, weak readers face obstacles not only in their linguistic competencies but also in other domains like science. Hence, to improve the science performance especially of weak readers, reading comprehension and reading strategies need to be trained, so weak readers do not continue to suffer from their below-average reading comprehension that may lay the foundations for their general academic achievement (e.g., Cooper et al. 2014;McGee et al. 2002;Savolainen et al. 2008). Consequently, reading comprehension and strategies to deal with difficulty-generating linguistic features should not only be taught in regular language classes, but should also be promoted in science education (or indeed throughout all classes). Providing instructional scaffolding techniques, for example, has been shown to be effective in improving students' reading comprehension (Huggins and Edwards 2011;Salem 2016). These instructional scaffolds work as a temporary guidance and support for students and are removed little by little over time, allowing students to acquire new skills and independence from their teachers, gaining freedom in their way of learning (Salem 2016).
Furthermore, the finding that increased item word count shows an increase in students' science performance for good readers sheds some light on the ambivalent reports of previous research. First, unlike Bird and Welford (1995), Cassels and Johnstone (1984), and Prophet and Badede (2009), we found that particularly good readers benefitted from increased word count that led to decreased item difficulty, which is in line with findings from Stiller et al. (2016). This might be due to the fact that longer tasks provide more information for students to solve the science items, which particularly good readers are able to benefit from by recognizing and utilizing the additionally allocated information (McNamara and O'Reilly 2010;Lau and Chan 2003).
Second, another explanation could be the key to the ambivalence of item word counts' importance: Irrelevant information (Cassels and Johnstone 1984;Prophet and Badede 2009) is sometimes included in academic achievement measures to distract students and to test their competence in capturing the essence of a task, being able to recognize and neglect irrelevant information. It may be the case that only students with high reading comprehension are able to do so, whereas students with lower reading comprehension fail to recognize relevant information that is embedded in irrelevant information.
Third, contextualized items may further explain our results. Contextual items are items that contain "supplemental information that precedes or follows the item question, such as a description of a lab setup, a natural phenomenon, or a practical problem" (Ruiz-Primo and Li 2016, p. 2). It has been argued that contextual items provide students with a context that is helpful to solve the task, because it makes test items more concrete and, therefore, realistic and relevant (Haladyna 1997). Some guidelines even suggest providing contextual information within an item, because it may be helpful for students (e.g., Kopriva 2000). Most science items in KESS are embedded in a context to test if students are able to apply their scientific knowledge into real-life contexts. Naturally, word count increases if a context is provided for an item. Considering our results, it seems possible that students with high reading comprehension benefit from items with an increased word count, because they benefit from the context provided in the item. Simultaneously, students with low reading comprehension may not be able to benefit from these contextualized items. While this reasoning seems plausible, Ruiz-Primo and Li (2016) state that much more research is needed in order to understand the underlying processes of dealing with contextual items since to date little is known about the link of contextual items and students' performance.
To the best of our knowledge, former studies did not take into consideration that item word count in science items might affect science performance significantly depending on students' reading comprehension. While these studies usually focused on either reading comprehension or item word count (and further linguistic features), our study combined these two variables.

Limitations and future directions
Despite having some notable strengths, such as the large sample of students, this study also has at least three noteworthy limitations that need to be taken into consideration when interpreting our results. First, the study was not conducted with an experimental design, and hence, causal conclusions cannot be made. Future research should focus on systematically modifying linguistic features in science measures. Having said that, the tests and the setting in this study were part of an existing large-scale assessment study, so that one benefit of our research might be its high ecological validity for real test situations. Thus, our results complement findings from experimental research (e.g., Bird and Welford 1995;Cassels and Johnstone 1984;Höttecke et al. 2018;Prophet and Badede 2009), drawing a more complete picture of language in science education.
Second, we focused on word count as one particular linguistic feature that may contribute to a variation in item difficulty to shed some light on the ambivalent findings (e.g., Bird and Welford 1995;Stiller et al. 2016). However, previous research indicated that further linguistic features such as academic language, tenses, or negation may also affect science performance (Bird and Welford 1995;Cassels and Johnstone 1984;Höttecke et al. 2018;Prophet and Badede 2009). Having said that, in this study, we were somehow restricted in the investigation of various linguistic features since the data and science tasks were given due to secondary data analyses. Coding the science tasks with regard to several linguistic features proved itself to be difficult since there was not enough linguistic variation in the available science measures.
Third, the question arises if the results can be generalized onto other student populations. KESS was conducted in Hamburg, Germany, assessing all students from a particular cohort. The test versions to assess students' competencies were rotated by chance, and hence, it seems plausible that our analytic sample is representative for other students in Hamburg, Germany. We cannot ensure generalization of our analytic sample onto other students in other federal states, but due to quite similar curricula in all federal states in Germany, it seems plausible that our results can be generalized to students in Germany.

Conclusion
Our study complements previous research, demonstrating that students' reading comprehension and item word count as one linguistic feature affect science performance. Our results indicate that with increased reading comprehension and word count, students' science performance increases as well. Particularly good readers seem to be able to extract the substance of written text from more extensive tasks and utilize it to solve items correctly. To ensure that weak readers do not suffer from a lack of competence in one area in other domains, enhancing reading comprehension in schools should be seen as a task not only for language education but also for science education or even across all subjects. Further research is needed, testing these findings in an experimental design and considering further linguistic features.