Introduction

There is wide agreement among scholars that academic and scientific language skills are highly important to communicate, process, comprehend, and apply scientific knowledge (e.g. Schleppegrell, 2004; Snow, 2010; Stoffelsma & Spooren, 2019). On the one hand, academic and scientific language supports teaching and learning because of its precise and condensed nature. On the other hand, the special features of academic and science language differ from students’ everyday linguistic repertoire, which results in challenges for learning the necessary language alongside the more content-focused competency in science education. Thus, academic language can be assumed to play a crucial role by either supporting or hindering science learning and understanding, which might lead to unequal learning opportunities. In fact, research currently indicates influences of reading proficiency and academic language skills on performance in mathematics and science assessments (Cooper et al., 2014; Härtig et al., 2015; Prediger et al., 2015), suggesting that high linguistic demands of items in science assessments hinder students from demonstrating their full potential. Moderate-to-strong correlations between text comprehension and science performance in large-scale assessments corroborated this finding (Cromley, 2009; Korpershoek et al., 2015; O’Reilly & McNamara, 2007; Yore et al., 2004). A reason for this might be that items used in science assessments evaluate students’ competencies and knowledge (e.g. Organization for Economic Cooperation and Development [OECD] 2017), which require students to read and comprehend text. However, studies that vary the linguistic demands of science and mathematics tasks systematically—and predominantly on the level of lexicon and syntax—to measure the influence of linguistic features on item difficulty have found small, inconsistent, or even expertise-reversal effects of increased text cohesion for students with high, prior, subject-specific knowledge (e.g. Haag et al., 2015; Höttecke et al., 2018; Kieffer et al., 2009). These results indicate that the linguistic presentation of test items only marginally affects an average student’s ability to comprehend and answer test items. Furthermore, these results cannot be attributed to text comprehension alone since the items used in these studies targeted subject-specific procedural skills and text comprehension at the same time without systematically separating these factors. Thus, variance in test performance is mostly due to subject-specific procedural knowledge and skills. The high correlation of reading proficiency and science performance in large-scale assessments might indicate an influence of students’ language skills on their learning and understanding in science.

We assume that the effect of linguistic demand is task-specific and depends on the subject-specific context. We argue that expository text for knowledge-building—typical in physics textbooks—places greater demand on students’ text comprehension, which might result in stronger effects of the text’s linguistic presentation on students’ comprehension compared to items in an assessment. Research on the effect of linguistic features on comprehension of expository text is far from coherent (e.g. Härtig et al., 2019; McNamara & Kintsch, 1996; Schmitz, 2015). Additionally, analysis of German science schoolbooks shows that the linguistic demands are generally high and do not systematically increase across grades (e.g. Berendes et al., 2017; Greasser et al., 2011; Kohnen et al., 2017a). Thus, the linguistic demands of schoolbooks are not sufficiently adapted to the language proficiencies of students. As a result, students’ perceived text complexity of textbooks is high (Härtig et al., 2012a, 2012b; Starauschek, 2003). Therefore, our research aims at a better understanding of the effect of linguistic demands on science learning and understanding by drawing on students’ text comprehension of expository text in physics. In view of inconsistent research results, we ask the following general research question: Does a systematic variation of linguistic demands in expository text in physics affect students’ text comprehension? To answer this research question, a model of text comprehensibility (Göpferich, 2009) used to draft the texts and isolate the effect of linguistic demand on text comprehension is outlined below. Furthermore, a theoretically and empirically derived heuristic model of linguistic demands in German to systematically vary linguistic features is presented. After the theoretical framework, development of our science text comprehension instrument is described, followed by results and discussion.

Theoretical Framework

Text Comprehension

Research on text comprehension has shown that text comprehension is a complex interplay between properties of a text and the reader (Schnotz, 1994). Text comprehension is a result of successful reading; it requires bottom-up processes in which information must be decoded from the text’s linguistic structure. In addition, readers relate, compare, and integrate textual information top-down into their prior knowledge to construct a mental representation of the textual information. Consequently, text comprehension means gaining and applying knowledge at the same time. While the general aim of school should be to support students in the development of skills to understand domain-specific academic text, scholars have suggested that expository text be adapted or even simplified to the level of students’ reading proficiency to facilitate their learning (Göpferich, 2009; Hsu & Yang, 2007; Schleppegrell, 2004).

Model of Text Comprehensibility

Several measures have been suggested to operationalize text difficulty based solely on textual features and without acknowledging the role the reader plays. The so-called readability scores based on the length of words and number of words per sentence are limited since they take only the text’s surface features into account. Comprehensibility models go beyond readability scores by acknowledging the role of lexis, syntactics, and semantics for text comprehension (Göpferich, 2009). Such models have been criticized as being imprecise, incomplete, or of limited practicability for researchers outside linguistics (Lutz, 2015). The comprehensibility model by Göpferich (2009) overcomes these problems and, therefore, was chosen to draft the text used in our study. It covers six dimensions:

  • • Structure is concerned with the internal semiotic order, consistency, and cohesion of information presented across a text

  • • Brevity/conciseness refers to the ratio of information represented to the length of a text; hence, a rather concise text refrains from any additional information

  • • Stimulating additives might stimulate a reader; yet, too many additions might be overwhelming and deviate their attention

  • • Simplicity is concerned with the linguistic demands (e.g. choice of words and sentence structure)

  • • Accuracy specifies the level to which a text is free of contradictions, technically correct, and follows the conventions of a genre-appropriate textual function

  • • Perceptibility generally means the ease with which a text is perceived depending on its visual appearance including the font, numerations, and any further typographic elements

The highest level of comprehensibility is not simply located at the maximum of each dimension; instead, the optimum of each dimension depends on characteristics such as a given type of text, its purpose, and target group. The linguistic demands of any text and the peculiarities of scientific and academic language belong to the simplicity dimension. Since all dimensions are interconnected, a single dimension cannot be changed without affecting the others.

Perception of Text Complexity

An additional feature that may influence the impact of textual demands on text comprehension is how complex a text is perceived by readers; a reader's perception of text complexity and comprehensibility affect their reading strategies. A perceived high complexity of a text might challenge a reader to read with increased care and attention, which in turn results in higher text comprehension. In fact, this might cause and explain the expertise-reversal effect mentioned above since readers with high prior knowledge might feel underchallenged by a text’s high cohesion, leading to superficial text scanning and lack of comprehension. Yet studies focusing on students’ perceived text complexity usually lack data about their actual text comprehension. Härtig et al., (2012a, 2012b) have shown that students perceive expository text in science textbooks as rather complex and incomprehensible. A comparative survey on two science textbooks showed that higher text cohesion decreases the level of perceived text complexity (Starauschek, 2003). Yet in an experimental study science text were varied on two levels of local substantival textual cohesion (i.e. neighboring sentences connected by the same noun), but no effect was indicated on perceived complexity (Starauschek, 2006). Tolochko et al. (2019) have shown that perceived complexity is only affected by a text’s semantics, not by its syntax. A reason might be that students struggle with rating a text’s syntax without prior training (Funke et al., 2013). In brief, research has not clearly indicated any effect of perceived complexity on text comprehension. Therefore, the present study investigates the effects of linguistic demand on both text comprehension and perceived complexity.

Model of Linguistic Demand

In this section, we briefly outline that research hitherto has not led to clear results which linguistic features either affect students’ science test performance or their science text comprehension. Based on findings of the two fields of research, a model will be outlined that allows for a systematic variation of linguistic demand along three major aspects.

Influence of Linguistic Features on Item Difficulty

Several studies from educational as well as cognitive psychology vary linguistic features of items to isolate the effect of linguistic demand on test performance. A high density of technical terms has been found to decrease test performance (e.g. Cassels, 1980; Schiemann, 2011; Snow, 2010; Stiller et al., 2016). The use of the passive voice has been shown to generate difficulty (e.g. Berndt et al., 2004; Bransford & Johnson, 2014; Cornelis, 1996), but it has also been found to have no effect (Cassels, 1980). The simultaneous variation of numerous linguistic item features has either shown small (Kettler et al., 2012; Plath & Leiss, 2017; Prophet & Badede, 2009), non-consistent (Höttecke et al., 2018), or even contradictory effects (Rivera & Stansfield, 2004). Meta-studies have corroborated the finding that the effect of linguistic demands of items is at most small (Kieffer et al., 2012; Pennock-Roman & Rivera, 2011). Pöhler et al. (2017) even revealed that a variation of how an item was presented (i.e. lacking any context, visually or textually contextualized) has no effect on its difficulty, indicating that cognitive complexity has a higher impact on science performance and text comprehension compared to linguistic demand.

Another approach to identifying complex language features is the linguistic post hoc analyses of data from large-scale assessments. Heppt et al. (2015) found academic terms, words with three or more syllables, and word compounds to be difficult while word count, complex sentence structures, and noun phrases were nonsignificant. Cruz Neri et al. (2021), on the other hand, identified word count as having a positive effect on students with high reading proficiency. Dempster and Reddy (2007) found that complex sentence structure and unfamiliar words lead to increased item difficulty. Prediger et al. (2015) have shown that linguistic demands have an effect and that, while complex or unfamiliar words were hardly regarded as challenging, complex sentence structures were.

Effects of Linguistic Features on Comprehension of Expository Text

Scholars have identified specific (i.e. mostly syntactical) linguistic features of science text as being difficult (Kareva & Echevarria, 2013), yet studies in this area are rare. Deppner (1989) simplified science text by reducing redundancy and complexity of sentence structure and used more frequently occurring words. She could not find any general effects on the text comprehension of grade 8 students except for those with Turkish as their heritage language. Studies on the semantic structure of science text have shown that text cohesion has either a positive (McNamara & Kintsch, 1996; Schmitz, 2015) or no effect on text comprehension (Kohnen et al., 2017b; Rabe & Mikelskis, 2007). In addition, Hsu and Yang (2007) have shown that a higher level of numerous types of text cohesion (i.e. lexical, semantical, syntactical, text-picture-relation) results in higher text comprehension. Studies taking a deeper look into the effect of text cohesion have taken readers’ language proficiency and prior knowledge as controls. They show that readers with high prior knowledge comprehend a text better if cohesion is low (Kamalski et al., 2008; Linderholm et al., 2008; McNamara & Kintsch, 1996; McNamara et al., 1996). This expertise-reversal effect has been challenged by evidence presented by others (Schmitz, 2015). If both general reading proficiency and prior knowledge are considered, different interactive effects of text cohesion on text comprehension were found (O’Reilly & McNamara, 2007; Ozuru et al., 2009) and yet refuted (Schmitz, 2015).

Previous research did not lead to coherent results and generally points toward small effects of linguistic features on item difficulty as well as text comprehension. Thus, further research is needed to show whether and which specific linguistic features of academic and scientific language influence text comprehension. Only density of technical terms is consistently found to decrease students’ performance in assessments. Linguistic text features such as cohesion, passive voice, unfamiliar (low frequent) words, words with numerous syllables, word count, noun and participial phrases, and complex sentence structure usually create small effects on students’ comprehension; these features are considered in the model used in the systematic variation of linguistic demands in this study. It is based upon a model developed by an interdisciplinary group of scholars in linguistics, science education, and empirical educational research presented in detail elsewhere (e.g. Heine et al., 2018; Plath & Leiss, 2017). It considers the specifics of academic language used for teaching and learning that are frequently assumed to generate textual difficulty (Schleppegrell, 2004) by applying three major aspects of linguistic demands.

A Model of Linguistic Demands Based on Three Major Aspects

Our model systematically varies numerous linguistic features that have been shown to create higher cognitive load in eye-tracking, reading-time, and EEG studies or to require a broader linguistic knowledge base of peripheral and less frequent linguistic elements along three major aspects (Heine et al., 2018):

  • 1. Structural complexity represents the cognitive load determined by the quality and quantity of cognitive processes while reading. That is, the longer a particular information needs to actively be held in working memory until a unit of information can finally be processed (e.g. a sentence), the higher the cognitive load. This is the case when a noun and an associated pronoun or relative clause are rather distant from each other (Gabler, 2013; Konieczny, 2000) or when particle verbs are used in long constructions (Levy & Keller, 2013; Levy et al., 2012). In addition, complexity is found to be increased by nominalizations and nominal phrases (Fang et al., 2006; Kliegl et al., 2004; Solomyak & Marantz, 2010)

  • 2. Transparency describes how unambiguously information is presented. Processing semantically complex information (Fusté-Herrmann, 2008; Štekauer, 2005)—such as words with double meanings (Köhne et al., 2015), idiomatic phrases (Irujo, 1993), the passive voice (Berndt et al., 2004), and specific academic vocabulary (Cassels, 1980; Schiemann, 2011; Snow, 2010; Stiller et al., 2016)—requires knowledge of less obvious meanings. As a result, cognitive processing of information is less automatic

  • 3. Frequency, which often correlates with structural complexity and transparency, indicates how frequently a word (e.g. home = frequent, electron = less frequent), clause-type (e.g. canonical word order = more frequent than noncanonical word order, active voice = more frequent than passive voice, etc.), or grammatical structure are used in a particular language. Highly frequent structures create less cognitive load because they allow for automatic processing (Hulstijn, 2015)

Considering these three aspects, Heine et al. (2018) presented specific principles to establish three levels of linguistic demand; the cutoff points of which are meant to be interpretative and dependent on purpose and the target group of a text. The use of technical terms essential to an understanding of content is held constant across all three levels of linguistic demand, due to the expected strong effects of these terms on text comprehensibility and because technical terms are not translatable into linguistic structures of lower demand without loss of specific meaning. The three-level operationalization of the model (i.e. low–medium–high demand of text with the same content) has previously been shown to allow for only small effects on student performance in physics and mathematics test items and were not coherent across all items (Höttecke et al., 2018; Plath & Leiss, 2017). However, the model can be regarded as theoretically solid; it uses linguistic features that have been shown to create processing difficulty and assumes that the simultaneous variation of several of these features creates a cumulative higher cognitive load, which in turn results in measurably less comprehension (e.g. Paas et al., 2003).

Hypotheses of Present Study

As has been demonstrated in the previous sections, research has not shown coherent results regarding if, and if so how, linguistic features of a text affect students’ text comprehension. The present study tested two hypotheses: (H1) A higher linguistic demand of an expository science text results in lower text comprehension. We expected small effects of linguistic demand (independent variable) on students’ text comprehension in science (dependent variable). Given the above considerations about the possible mediating role of perceived linguistic complexity and comprehensibility, we furthermore examined if (H2) perceived comprehensibility confounds with students’ text comprehension in science.

Method

In two qualitative pre-studies and one quantitative pre-study, an instrument to measure German secondary students’ science text comprehension was developed and validated. As part of the validation, correlations of students’ science text comprehension with general reading proficiency (LGVT instrument; Schneider et al., 2007) and prior knowledge concerning the physics content (Einhaus, 2007) were used as control variables. The basic idea of the LGVT is to present participants with a narrative text in which numerous words are blanked and then the right word must be chosen from three options. Because prior domain-specific knowledge and general reading proficiency are among the strongest predictors of learning in general and text comprehension in particular (e.g. DIME Model: Cromley & Azevedo, 2007; Schmitz, 2015; Shapiro, 2004), we expected moderate-to-high correlations of students’ science text comprehension with their general reading proficiency and prior knowledge.

We presented the validated instrument consisting of three introductory texts on three subtopics of thermodynamics (i.e. particle model, thermal expansion, perception of heat) and a total of 27 single-select, multiple-choice items to 812 secondary school students (age M = 13.1, SD = 0.87; 51.7% female; 86.9% academic-school track; 39.6% grade 7; 43.0% grade 8; 17.3% grade 9). All items were targeting students’ text comprehension. The sample size exceeded the size of 690 that was requested by an a priori power analyses for a one-way analysis of variance (ANOVA) with three groups conducted with G*Power (\(\alpha\)= 0.05, \(1-\beta\)= 0.95, expected small effect size f = 0.15). Most students in our sample spoke German at home (75.1%); a minority (24.8%) used at least one additional language at home.

As demonstrated in the schematic schedule of the main study (Fig. 1), the order of the three subtopics (i.e. particle model, thermal expansion, perception of heat) and the corresponding text comprehension items were the same for each student. The linguistic demand (i.e. A = low, B = medium, C = high) was systematically rotated to reduce order effects on linguistic demand and to control learning effects on physical content.

Fig. 1
figure 1

Schematic schedule of the main study including the booklet design of the science text comprehension test

Materials

Development of Text Variants

First, we applied stable comprehensibility measures according to Göpferich (2009) to maintain a similarly high degree of well-formedness/comprehensibility for all texts and to control any additional variance that might otherwise overshadow any effect of linguistic demand on text comprehension (Fig. 2, left box). To this aim, the application of Göpferich’s dimensions on the four text versions about issues in thermodynamics were checked, rechecked, and verified by a group of trained linguistic experts (Hackemann, 2022; Fig. 2, left cycle). Figure 2 presents all steps followed in the development of text variants in our study.

Fig. 2
figure 2

Development process of text variants

All texts begin with a description and explanation of an everyday physical phenomenon, followed by three sections that address frequent pre-instructional ideas and elaborate further on the phenomenon (e.g. structure, stimulative additives). All information presented is concerned with the phenomenon at stake only while any narrative elements are avoided (e.g. brevity). Each section begins with a subheading to clarify content structure (e.g. perceptibility). Through this structure, the texts partly fit the definition of refutation text, which has been empirically proven to foster learning (e.g. Ariasi & Mason, 2014; Mason et al., 2019). Classic refutation texts make students’ pre-instructional ideas about a particular topic explicit, argue directly against them, and finally introduce a new scientific concept as a viable way of thinking (Ariasi & Mason, 2014). Text structures in our study differ from refutation text as students’ pre-instructional ideas are merely addressed as a guiding principle for a physics explanation instead of being central.

Although textbooks are usually multicoded and encompass text as well as images, tables, charts and the like, this study is based upon the use of plain text only to isolate the single effect of linguistic demand (e.g. Brown & Hudson, 1998; Hung, 2014; Jian, 2019). The decision was made against an extraction of expository text from German physics textbooks because they hardly meet the high baseline of comprehensibility according to Göpferich (2009) and because they had been found difficult to comprehend (e.g. Härtig et al., 2012a, 2012b).

Qualitative Validation of Texts’ Comprehensibility

To ensure the overall comprehensibility and perceptibility of all texts in our study, a pre-study was conducted based upon focused interviews (Merton & Kendall, 1979). In total, 16 grade 8 students first read one of the draft texts and then answered a series of questions concerning text comprehensibility and text perception. The results of the qualitative content analysis following the guidelines of Boyatzis (1998) and Mayring (2015) (Cohens K=0.82) showed that one text version was less comprehensible compared to the others; therefore, it was deleted from the text pool. The remaining three texts were rated by the students as more comprehensible compared to typical textbooks (Hackemann, 2022); students’ estimations of text comprehensibility fit well with the comprehensibility predicted by the model of Göpferich (2009). Additionally, minor inconsistencies and unanticipated comprehension problems (e.g. with the technical term matter) led to further comprehension improvements to the remaining three texts.

Validation of Implementation of the Model of Linguistic Demands

The three texts developed up to this point had to be varied across levels of linguistic demand. This variation was collaboratively and iteratively discussed by experts of physics education and linguistics to ensure that further linguistic dimension (e.g. structure, brevity, etc.) as well as the quality of how physics content (e.g. accuracy) was held constant (Fig. 2, right cycle). Table 1 contains a short, translated section of text that presents how linguistic demands were varied.

Table 1 Carefully translated section from the particle model text varied on three levels of linguistic demand

Table 2 shows detailed information about the variants of linguistic demand of the texts used in our study and reports mean frequencies of the linguistic features varied across the three levels. Word frequency was estimated through one of the standard language corpora for German (https://wortschatz.uni-leipzig.de/de) with a scale of word frequency from 1 (very frequent) to 25 (very rare) with all domain-specific technical terms excluded from the analysis. The three levels—low (1–9), medium (10–13), and high (> 13)—were used to match the frequency levels to the overall categories of linguistic demands: A (low), B (medium), and C (high). All presented values matched our model.

Table 2 Variation of linguistic features on three levels of linguistic demand applied in the study

As part of the validation, a score for readability and cohesion (Kulgemeyer & Starauschek, 2014) was calculated and are shown in Table 3. With few exceptions, the values indicate that the implementation of the model applied in this study (Heine et al., 2018) indeed does result in the expected variation of linguistic demand. Values for word count, 3 + -syllables words, and the Vienna expository text formula (Bamberger & Vanacek, 1984) describe the structural complexity of a text and, therefore, increase as expected with level of linguistic demand from A across B to C. However, local text cohesion of nouns per sentence ran contrary to the model. This can be explained by the drastically reduced number of sentences in B-variants and particularly in C-variants (see number of word count per sentence in Table 2). To take this into consideration, we established the local text cohesion of nouns per clause additionally to the score provided by Kulgemeyer and Starauschek (2014). We assumed that the ratio of neighboring clauses connected by the same noun is a stronger indicator for a text’s cohesion in comparison to the ratio of neighboring sentences connected by the same noun, especially when complexity of sentence structures increases. Table 3 reports values for the local text cohesion of nouns per clause, corroborating the intention of the model. Additionally, values of the mean ratio of factual terms corroborated the model by being almost constant across levels of linguistic demand.

Table 3 Indicators for the comprehensibility of expository science text by Kulgemeyer and Starauschek (2014) applied to expository physics text to validate level of linguistic demand

To summarize, all texts developed for our study meet a high baseline of comprehensibility (Fig. 2: left cycle). They address three different topics in thermodynamics. All texts were systematically varied on three levels of linguistic demand (Fig. 2: right cycle). This was done by a systematic variation of linguistic features frequently assumed to create higher structural complexity, cognitive load, and lower transparency (Fig. 2: Model of linguistic demand). Our major expectation was that this variation should result in different levels of student comprehension (H1).

Measures

Development of Science Text Comprehension Items

Mullis et al. (2015) presented four comprehension processes used in the PIRLS Framework, which differed only marginally from processes applied in other studies: “Focus on and Retrieve Explicitly Stated Information; Make Straightforward Inferences; Interpret and Integrate Ideas and Information; Evaluate and Critique Content and Textual Elements” (p. 13). To assess the construct of science text comprehension (dependent variable), we applied the first three cognitive processes but not the fourth since it would require readers to make a justified judgment based upon their wider understanding of the world, which is not a focus in our study. Each item was presented with one attractor and four distractors. For illustration, a translated item is presented below; it addresses the lowest cognitive process (i.e. focusing and retrieving explicitly stated information) and refers to the section presented in Table 1:

How long has the particle model existed?

  • • More than 20 years.

  • • More than 100 years.

  • • More than 200 years.

  • • More than 500 years.

  • • More than 2000 years.

To validate our instrument, we conducted two pre-studies that are described below.

Qualitative Validation of Science Text Comprehension Items. First, the quality, general comprehension, and perception of 51 items were evaluated by applying the method of stimulated recall (Gass & Mackey, 2017). For this purpose, 32 secondary school students read one text variant and then answered 10 items each. After completing an item, students were asked to justify their choice and which knowledge they thought they had applied to solve the item (e.g. which sections of the text appeared to matter to them). If an item was answered incorrectly, students were immediately provided with the correct answer and were asked if—and if so, why—they changed their mind. By applying a qualitative content analysis (Cohens K=0.85; Mayring, 2015), we examined students’ verbal explanations and answers of the multiple-choice items to determine if the explanation and answer chosen confounded. Overall, 37 items functioned as intended and appeared to be comprehensible, while 7 items were revised by either improving the prompt or any of the multiple-choice options. Another 7 items were eliminated from the pool mostly because students were able to choose the right answer without any reference to the text.

Quantitative Confirmatory Analyses of Science Text Comprehension Test. Second, we evaluated item quality of our science text comprehension test in a quantitative pilot study of 147 secondary school students who answered 11 prior knowledge items, read three texts, and then answered 45 text comprehension items each. The results were used to either develop or delete items by checking item and test quality values (e.g. MNSQ fit, one-dimensionality, differential item functioning of the IRT analysis) of both tests separately.

As expected, most items of the science text comprehension test met the requirements of IRT modeling (Robitzsch et al., 2020; Rost, 2004). Nine items were eliminated because either (a) their MNSQ infit was out of the range of 0.8 and 1.2, (b) their MNSQ outfit was out of the range of 0.7 and 1.3, or (c) they showed differential item function with split-criteria gender, school track, or chance in the Mantel–Haenszel test with p \(\ge\) 0.05. The final item pool reached the required values in the DETECT test for one-dimensionality (weighted DETECT = 0.32, ASSI = 0.15, RATIO = 0.22 in which values of DETECT > 0.40, ASSI > 0.25, RATIO > 0.26 are interpreted as more dimensionality; Zhang & Stout, 1999). These results all supported construct validity since only a small number of misfitting items occurred and the test proved to be one-dimensional.

Multiple-choice options were either rephrased or their order changed if students with a high weighted likelihood estimation (WLE) of person ability chose a distractor rather than the correct attractor. This was the case when the point biserial correlation between person WLE and an attractor of an item was \({r}_{pb.WLE}\le\) 0.3 or for distractors \({r}_{pb.WLE}>\) 0. The same procedure was required if the relative frequency of a distractor was below 5%. After deletion and reformulation, 35 items remained in the item pool for the main study. Items that addressed more similar physics content were excluded in the next step because test time constraints in the main study led to a total of 27 items (9 per text). The aims of this procedure were to address as broad a range of cognitive abilities as possible and to distribute item difficulties equally across each text.

Results

The prior knowledge test met all requirements of IRT modeling stated above; therefore, all items were used to estimate item and person parameters. The WLE reliability of the prior knowledge test turned out to be low (R = 0.37) due to the small number of items, with a mean person parameter of M = 0.001 (SD = 0.95) and mean item difficulty of M = 0.49 (SD = 1.27). Due to the restricted test time, administration of an increased number of items in the prior knowledge test would have resulted in fewer items in the science text comprehension test. This was not an option for us since text comprehension was our dependent variable.

Two items of the science text comprehension test presented values outside the range of \(0.8\le \mathrm{MNSQ}-\mathrm{Infit}\le 1.2\), thereby violating the IRT modeling requirements; they were excluded from further analysis. The remaining 25 items met all requirements of the IRT model and were used to estimate item and person parameters. For each item, three difficulty values were estimated in accordance with the corresponding linguistic demand A, B, or C of the text read before answering the item. The WLE reliability of the science text comprehension test was sufficient (R = 0.78) with a mean person parameter of M = 0.01 (SD = 1.08) and a mean item difficulty of M = -0.41 (SD = 0.80).

Correlations of Science Text Comprehension with Further Constructs

We first checked if the results of our science text comprehension test correlated with traits of a test person as control variables according to spearman. There were positive correlations between science text comprehension and reading proficiency (r = 0.39), prior knowledge (r = 0.50), cultural capital (r = 0.22), class grade (r = 0.24), and school track (non-academic/academic) (r = 0.33), all at p < 0.001. Cultural capital was operationalized by a books-at-home index. The number of books at home has been indicated as a strong predictor for students’ reading proficiency (e.g. Mullis et al., 2015). As expected, reading proficiency and prior knowledge were the strongest predictors of science text comprehension. In accordance with Härtig et al. (2012a, 2012b) and Höttecke et al. (2017), low correlations of reading proficiency either in general or science showed that they are separate but related constructs. Unsurprisingly, prior knowledge had the highest correlation since we only selected items from Einhaus’ (2007) thermodynamic competence test that corresponded to the subtopics addressed by our instrument. All correlations were moderate and corroborated the validity of our science text comprehension test in addition to the results of the pre-studies. Additionally, multilingual students who speak more than one language at home and students who only speak German at home accomplished similar results, whereas multilingual students who speak a single language other than German at home accomplished significantly lower results (F(739, 2) = 6.86, p < 0.001, with post hoc Tukey test at p < 0.05). Nevertheless, the latter result should not be overstated since research has shown how speaking a single language other than German at home confounds with lower reading proficiency, prior knowledge, cultural capital, and chosen school track. While at first sight multilingualism is seen to correlate with science text comprehension, on second sight regression modeling indicates a minor role of multilingualism under control of reading proficiency and prior knowledge according to research (e.g. OECD, 2017; Prediger et al., 2015) and our results (Hackemann, 2022).

Effect of Linguistic Demands on Science Text Comprehension (H1)

Our expectation that a text drafted with a high linguistic demand would lead to lower students’ text comprehension (H1) must be rejected. Table 4 shows there is a slight increase in mean item difficulty from linguistic demand A across B to C as expected, but this effect is not significant.

Table 4 Results of one-way analyses of variance of item difficulty and solution frequency across levels of linguistic demand

Additionally, no significant differences in item solution frequency in subgroups of poor to strong readers and students with low to high prior knowledge (divided by split-criteria median of prior knowledge person ability and reading proficiency) were found. Even after computing a one-way ANOVA for comparisons in top/bottom quantiles of students’ reading proficiency, no differences in item solution frequency across linguistic demands were found (top quantile F(2, 72) = 0.15; p = 0.86; bottom quantile F(2, 72) = 0.30; p = 0.74) or for students’ top (F(2, 72) = 0.02; p = 0.98) and bottom (F(2, 72) = 0.02; p = 0.98) quantiles regarding prior knowledge or any other computed subgroups. Additionally, no significant differences of item difficulty across linguistic demand for any isolated text could be identified (Hackemann, 2022).

Figure 3 shows the Wright map of all computed item difficulty values of our science text comprehension test ordered by subtopics in thermodynamics. Item difficulty values corresponding to the same item and different linguistic demands are connected by black lines to accentuate differences across the three levels of linguistic demand. Some of the black lines indicate the expected influence of linguistic complexity on item difficulty across levels of linguistic demand (e.g. Item A7) while others show the opposite (e.g. T7) or almost no difference (e.g. A15). Difficulty values of items referring to the text presented the latest appeared to be generally higher than others (F(2, 72) = 3.80; p = 0.03). This result is due to an ordering effect since students’ attention might have decreased across test time, resulting in higher item difficulty values.

Fig. 3
figure 3

Results of wright map of the science text comprehension test presenting item difficulty values and density of students’ person parameters

Note. Item difficulty values corresponding to the same item but to different levels of linguistic demand (A-B-C) are connected by black lines. Items starting with A refer to the text particle model, items starting with T refer to the text thermal expansion, and items starting with W refer to the text perception of heat.

Perceived Comprehensibility of Our Instrument (H2)

As expected, our results indicate that the students’ perceived degree of comprehensibility was affected by the linguistic demands. Text comprehensibility was assessed by asking all students to rate the texts on levels from 1 (very comprehensible) to 6 (not comprehensible). Students’ perceived comprehensibility was 2.50 (SD = 1.12) for the low linguistic demand A, 2.52 (SD = 1.11) for the medium linguistic demand B, and 2.88 (SD = 1.2) for the high linguistic demand C. A one-way within-subject ANOVA showed that the effect of linguistic demands on students’ comprehensibility rating as an indicator of their perceived complexity was significant (F(2, 1512) = 40.24; p < 0.001). Two paired-samples t tests were applied as post hoc comparison tests to investigate if levels of linguistic demand might have resulted in significant differences of perceived comprehensibility. Significant differences of perceived comprehensibility were found between variants A and C (t(754) =  − 7.82; p < 0.001) as well as B and C (t(754) =  − 7.26; p < 0.001); differences of perceived comprehensibility of linguistic demand between A and B failed to be significant (t(754) =  − 0.53; p = 1.00). This is seen as proof of students perceiving variant C as harder to comprehend than the other two. Nevertheless, (H2) must be rejected as well; the perceived comprehensibility does not confound with students’ actual text comprehension because the latter does not vary significantly across levels of linguistic demand (Table 4).

Discussion

The present experiment assessed the influence of linguistic features as well as perceived comprehensibility on students’ ability to comprehend expository science text. However, linguistic demand could not explain the variance in science text comprehension we found in our sample of secondary school students.

Challenging to Read, Easy to Comprehend?

Even though a significant effect of linguistic demands on text comprehension was not detected, our results point to the fact that students perceive a text as less comprehensible if it is linguistically demanding. More succinctly, linguistic demands were perceived as a challenge, but, in terms of text comprehension, there were none. This result appears to be contradictory but can be explained in plausible ways. It may be the case that if a text is perceived as linguistically demanding students will feel challenged. Consequently, they might read the text more carefully, more slowly, and with an increased level of attention and focus. Hence, an effect of linguistic demands on text comprehension might be compensated by an increased awareness of these demands, resulting in an adaption of reading strategies. Moreover, the results presented in this paper corroborate findings of Tolochko et al. (2019) on the interrelation of perceived complexity and actual text comprehension, thereby putting into question studies that were based upon students’ perceived complexity and comprehensibility as an indicator of actual text comprehension.

Based on the research and findings presented here, we assume that the perception of high linguistic demand may lead to an adaptation of reading strategies resulting in a compensatory effect. Whether reading strategies will be adapted in such a case may be mediated by students’ motivation and interest in the content presented to them. This would mean that students’ lower performance in assessments might not be related to their reading proficiency alone but to affective traits.

Is the Influence of Linguistic Complexity on Text Comprehension Currently Overstated?

As stated by Pohl (2016), recent research is doing the second step before the first if concepts we do not yet fully understand are used to inform pedagogical programs for academic language acquisition. Instead, we should first deepen our knowledge about the effects of academic language features on learning in general and on text comprehension in particular. We contribute to the body of research by unveiling that the systematic variation of linguistic features found to create higher cognitive load—including academic language features frequently assumed to generate difficulty in text comprehension—does not significantly affect text comprehension after all. Nevertheless, even on a rather demanding level of linguistic complexity, the level of text comprehension has been found not to decrease. In accordance with prior research (Härtig et al., 2019), the results of our study provide further evidence that the correlation of students’ reading proficiency and science performance found in large-scale assessments might be less related to the linguistic demand posed in assessment items or expository texts than expected. All this implies that a reduction of linguistic demands in science classes will most likely not support science learning for regular secondary school students in general.

Limitations

Due to limited test time as well as regulations of the corona-pandemic, a limited number of factors influencing text comprehension and item difficulty were controlled. Thus, our results are based upon a convenience sample that underrepresented the non-academic school track. The design itself has limitations as well; the generally high text comprehensibility of our instrument could partly explain the results. Maybe drafting an expository physics text by means of a comprehensibility model like Göpferich’s (2009) results in a strong reduction of cognitive load or even the absence of any overload. If this is true, a generally high comprehensibility of an expository science text might lead to a compensation of its high linguistic demand. As a result, this limitation allows for the apprehension that authors of school textbooks should applicate comprehensibility models that support science text comprehension and learning.

In corroboration with Höttecke et al. (2018), our results query the assumption of a linear correlation of linguistic demand and level of text comprehension suggested by Heine et al. (2018). Contrasting ample psychological research varying singular surface features, the more realistic variation of linguistic demands by multiple surface features does not explain different levels of text comprehension in our study. We assume that science text comprehension is a rather complex activity influenced by various factors usually not controlled in experiments. Further research and development of such models need to consider additional influencing factors of text comprehension with a focus on mediating effects of linguistic demand such as the above-mentioned affective and motivational traits.

Future Research

All this implies that further research is required. Since it is unclear if the text of our instrument is more comprehensible in comparison to those in textbooks, studies comparing our text with expository physics textbooks focusing on the same content are needed. Most likely certain specifics of our text—such as its structure, the consideration of students’ pre-instructional ideas, and a focus on a physical phenomenon—foster students’ comprehension independently of their linguistic demands. Additionally, it is still unclear how a reader’s perceived complexity affects their attention or motivation to deal with a text, leading to an adaption of reading strategies and varying levels of text comprehension. In future studies, the adaption of reading strategies as well as motivational and affective factors should be considered with a closer focus on interaction effects.