Introduction

New reading mediums have undoubtedly changed the conception of what constitutes a typical reading situation (e.g., Salmerón et al., 2018). However, reading comprehension still rests on the ability to recognize words and construct meaning at the levels of words, sentences and paragraphs, and the text as a whole. Reading comprehension is not only foundational in educational contexts; it is also essential for active participation in the society, the possibility to get a job, and the capability to provide for oneself and one’s family (UNICEF, 2023).

Theoretical models of reading comprehension converge on the idea that readers need to integrate content within a text as well as text content and prior knowledge in order to build a coherent mental representation of the text (McNamara & Magliano, 2009). Arguably, the most influential of these models is the construction-integration model by Kintsch (1988, 1998), which accounts for a variety of data from numerous reading experiments, such as data on reading times, text recall, and text summarization. The construction-integration model suggests that readers form three layers of mental representation when comprehending a text. The surface code refers to the exact words and sentences in the text and can be considered to have a weak relationship to comprehension. The level of the textbase refers to the structure of the content presented in the text and, as such, captures the text internal gist meaning of the text. A situation model is formed when content in the textbase is integrated with relevant prior knowledge used to draw inferences that go beyond what is explicitly stated in the text, with the integration of textbase content and prior knowledge presumably taking place in working memory and thus being dependent on the working memory capacity of the reader. Of course, situation model construction is also dependent on word level reading processes (i.e., word recognition and vocabulary) to create the surface code and the textbase from which the situation model arises. According to Kintsch (1998), the ability to form a coherent mental representation of text at the level of the situation model is particularly important because such a representation can be applied in new contexts. However, applauding deeper level situational understanding of text is one thing, measuring it in an effective and, not least, efficient way, is another that represents a great challenge to both researchers and educators.

A critique against many measures of reading comprehension is their lack of any explicit grounding in theories of reading comprehension (Kendeou, 2020; McMaster & Kendeou, 2023). When this complex construct is measured, the aspect of reading comprehension that is assessed and thus the cognitive processes involved in answering questions and responding to items will depend on the type of measure that is used and the theory on which that measure rests (Cho et al., 2023; Hua & Keenan, 2017). Consequently, a reading comprehension measure built on the construction-integration model should go beyond readers’ textbase and try to capture their ability to draw inferences (i.e., connecting the textbase to prior knowledge) and achieve a deeper understanding of the situation described in the text (Leslie & Caldwell, 2017).

Measures of reading comprehension are often divided into the main formats of constructed response and selected response (Haladyna & Rodriguez, 2013; Leslie & Caldwell, 2017). The constructed response format requires that readers produce (construct) responses (typically written or oral) based on prompts, for example, in the form of open-ended questions, essay tasks, or instructions for free recall. The selected response format includes items presenting readers with a set of possible responses and requires that they select the best ones, with multiple-choice, re-ordering of paragraphs, and different types of fill-in-the-blanks (i.e., cloze tests) being the most common implementations of this format (Becker & Nekrasova-Beker, 2018). Great benefits of the selected response format are efficient administration, easy dichotomous scoring procedures, and few to no scoring errors (Haladyna & Rodriguez, 2013). The question is, however, whether tests using a selected response format, such as cloze tests, also can capture deeper levels of reading comprehension.

Elbro and his colleagues have been at the forefront of research to develop cloze tests that target inferential or situational level understanding of text (Gellert & Elbro, 2013; Jensen & Elbro, 2022). Thus, Gellert and Elbro (2013) designed a cloze test that required local bridging inferences to select the correct alternatives (i.e., words) for 41 gaps included in five narrative and five expository texts that ranged in length from 40 to 330 words. Specifically, the local bridging inferences were required to generate connections between adjacent sentences, for example: Your skin may also become dry during the flight. Therefore bring [water—ear plugs—medicine—cream] on longer flights. In a validation study of the original, Danish version of this test, Gellert and Elbro (2013) showed that the scores of adult and young adult readers were positively correlated with their scores on word decoding, vocabulary, and standardized question-answering reading comprehension measures, and both word decoding and vocabulary were found to be unique predictors of readers’ scores on this cloze test. A Norwegian version of this cloze test was used in a series of studies by Latini and colleagues in which undergraduate readers’ scores on this measure correlated positively with their deeper level comprehension of both single (Latini & Bråten, 2022; Latini et al., 2020) and multiple texts (Latini et al., 2019). Likewise, Haverkamp et al. (2023) found that the scores of Norwegian undergraduate and postgraduate students on this measure correlated positively with their working memory as well as with their coherent mental representation of an expository text. Strømsø (2023) found that performance on the Norwegian version of this cloze test was uniquely predicted by undergraduates’ print exposure, unless students also were highly exposed to websites, especially social media sites.

Recently, Jensen and Elbro (2022) designed and validated a “deep cloze test” that targeted global situational rather than local bridging inferences. Thus, for each of the narrative text passages included in the test, readers had to generate and use a prior knowledge-based interpretation of the situation described in the passage (i.e., situation model) to achieve a coherent mental representation that could drive their selection of the correct alternative (i.e., word). This means that unlike typical selected response tests, this test seems to require that readers construct a coherent mental representation of the situation described in the passage before they can complete the task by filling in the missing detail. As an example, an English translation of one passage from this cloze test read:

She had to be ready in two hours so she was in a bit of a rush. The bag was already in the car and the ticket, keys, and wallet were in her pocket. Her husband ran after her with her [passport, packed lunch, shopping list, USB key]. It was lucky, otherwise she would not have got very far.

(Jensen & Elbro, 2022, p. 1233)

Of note is that readers could be assumed to be able to generate situation models for the narrative passages included in the test by drawing on common world knowledge. Jensen and Elbro (2022) provided some validation data for the original, Danish version of the deep cloze test, reporting that the scores of adult readers were highly correlated with their scores on a standardized reading comprehension test as well as with their scores on other reading-relevant variables (vocabulary, sentence comprehension, topic identification). In a validation study of a Spanish adaptation of the deep cloze test designed by Jensen and Elbro (2022), Salmerón and colleagues (2022) found that although undergraduate readers’ scores on the deep cloze test were not statistically significantly correlated with their scores on a standardized reading comprehension test targeting inferential understanding (Study 2a), they uniquely predicted factual and inferential understanding of a researcher-generated comprehension test after perceived prior knowledge was controlled for (Study 2b). Further, undergraduates’ cloze test scores uniquely predicted their course grades after variance associated with non-verbal intelligence was accounted for. In the current research, we aimed to provide a deeper understanding of the deep cloze test by examining not only potential antecedents of students’ performance on this measure but also the potential of this measure to predict both distal and more proximal outcomes when other relevant individual differences are controlled for.

Thus, across two studies, we examined to what extent undergraduate readers’ scores on the Norwegian version of the deep cloze reading comprehension test designed by Jensen and Elbro (2022) could be predicted by their language background as well as by cognitive individual difference variables and basic reading skills. Readers’ language background was included in both studies because it may indicate differences in vocabulary relevant for the performance on this test (Jensen & Elbro, 2022). Other predictors deemed relevant were readers’ working memory (Studies 1 and 2) and word recognition skills (Study 1) because they, in accordance with Kintsch’s (1988, 1998) theory, may underlie situation level understanding of text. Finally, we explored whether readers’ cognitive reflection, as an indication of rational thinking (Toplak et al., 2014), might be related to their ability to evaluate the response alternatives and select the correct one in the context of the described situation (Study 1). However, further understanding of the deep cloze reading comprehension test seems to require not only new knowledge about potential contributors to test performance but also new knowledge about whether and how test performance may be related to external criteria. In the current research, we therefore related readers’ scores on this test to a more distal outcome in the form of course achievement (Study 1) and to a more proximal outcome in the form of integrated understanding of multiple expository texts as assessed with an open-ended written comprehension measure (Study 2).

In summary, we addressed the following two research questions:

  1. 1.

    Do readers’ language background, cognitive individual differences (working memory, cognitive reflection), and basic reading skills (word recognition) explain variance in their scores on the deep cloze reading comprehension test?

  2. 2.

    Do readers’ scores on the deep cloze reading comprehension test explain variance in their course achievement and their integrated understanding of expository texts?

Study 1

The purpose of this study was to examine whether students’ language background, relevant cognitive individual differences (i.e., cognitive reflection and working memory capacity), and basic reading skills (i.e., word recognition) would contribute to their scores on the deep cloze reading comprehension measure. Additionally, we examined whether scores on the reading comprehension measure would be positively related to students’ achievement in the course (i.e., their course grades). Based on the construction-integration model of text comprehension (Kintsch, 1988, 1998), the nature of the deep cloze test, and prior research on individual differences and components of reading comprehension (Afflerbach, 2016; Israel, 2017), we expected that language background as well as working memory and word recognition would explain (possibly unique) variance in students’ deep cloze test scores. Further, based on the assumption that responding to the deep cloze test would require reflection on the response alternatives (i.e., words) in order to select the one that fits the inferred discourse context, as well as prior research linking cognitive reflection to scores on an inference-based cloze test (Latini et al., 2019), we considered it likely that students’ scores on the Cognitive Reflection Test (Frederick, 2005) would be positively related to their scores on the deep cloze reading comprehension measure. Finally, based on prior research linking reading comprehension to college achievement (Clinton-Lisell et al., 2022), as well as a Spanish validation study of the deep cloze test (Salmerón et al., 2022), we expected that students’ scores on the deep cloze test would be positively related to their course grades.

Participants

Participants were 90 students at the University of Oslo who were enrolled in a bachelor program in special education. Their mean age was 23.42 years (SD = 3.92), and 86.7% identified as female, 8.9% as male. Four participants did not report on gender identification. Most (58.9%) had Norwegian as their sole language background, while 13.3% had another language background, and 23.3% had a mixed language background (i.e., Norwegian and another language). Four participants did not report on language background. The Norwegian Social Science Data Services approved collection and handling of the data.

Materials

Demographic survey

We used a brief demographic survey to gather information about participants’ age, gender identification (“with which gender do you identify the most?”), and language background. With respect to language background, we asked participants in which language their parents talked to them when they grew up, using the three response categories of Norwegian, another language, or another language and Norwegian. For the statistical analyses, the two latter categories were collapsed, with only Norwegian coded as 1 and another language with or without Norwegian coded as 0. Finally, participants reported their final grades in a developmental psychology course that was central to their bachelor program (see Course Achievement below).

Cognitive reflection test

The Cognitive Reflection Test (CRT) was designed by Frederick (2005). This test focuses on learners’ ability to solve problems by overriding a prepotent intuitive, yet incorrect response alternative and engaging in further reflection and rational thinking that can lead to the correct response (Toplak et al., 2014). The CRT can therefore be said to demand deeper-level self-regulatory processing. In terms of validity, prior research has demonstrated positive relations between scores on the CRT and scores on other rational thinking measures, and also shown that scores on the CRT can predict rational thinking skills over and beyond cognitive ability and various thinking dispositions (Kahneman, 2011; Kahneman & Frederick, 2007; Toplak et al., 2011). Strømsø and Bråten (2017) developed and validated a Norwegian adaptation of the CRT, which was further validated by Bråten et al. (2017) and Latini et al. (2019).The test includes three numerical problems (sample item: “A nut and a bolt cost NOK 1.10 together. The bolt costs NOK 1.00 more than the nut. How much does the nut cost?”). One point was awarded for each correct answer, such that the possible maximum score was 3. The reliability of participants’ scores (Cronbach’s α) was 0.75.

Word recognition measure

We used the Word Chain Test, which was standardized by Strømsø et al. (1997), to assess participants’ ability to recognize whole words rapidly. Participants were presented with 360 words organized in 30 rows. Four clusters of words, each consisting of three high frequency words, were included in each row. These words were printed without any spaces between the words (e.g., stopsleepyellow). Participants were given 180 s to indicate as many words as possible by drawing vertical lines between them. Scoring was done by counting the number of correct word clusters, such that the possible maximum score was 120. Prior research has shown high reliability for scores on this measure. For example, Strømsø et al. (1997) reported an α of 0.95 in a sample of university students, and Braasch et al. (2014) reported an α of 0.93 in a sample of secondary school students. In terms of validity, Bråten et al. (1999) presented evidence that the Word Chain Test favors readers with well-established and accessible orthographic representations in the lexicon. Further, validity has been demonstrated by correlations between scores on this measure and both silent and oral word reading skills (Jacobson, 1995). Finally, Andresen et al. (2019) showed that scores on this measure clearly differed between lower secondary school students with and without dyslexia.

Working memory measure

A Norwegian adaptation of Swanson and Trahan’s (1992) Working Memory Span Task was used to assess participants’ working memory. This approach to measuring working memory was originally developed by Daneman and Carpenter (1980), and the adaptation we used in this study has been validated in several previous studies with Norwegian postsecondary students (e.g., Bråten et al. 2022, Delgado et al. 2020, Haverkamp and Bråten 2024). The materials used in this task consisted of 42 unrelated declarative sentences ranging from five to 12 words that were organized into 12 sets of sentences. An increasing number of sentences (from two to five) were included in each set, with the sentences in each set read aloud to participants with two seconds between each sentence. Participants were instructed to comprehend each sentence in the set, such that they could write down the answer to a question about the content of one of these sentences after the final sentence in the set was read aloud. Afterwards, using the same response form, they should write the last word of each sentence in the set. The scoring of the working memory task was done by counting the total number of words recalled across the 12 sets, such that the possible maximum score was 42. No points were given for correctly recalled words if the comprehension question for the set was answered incorrectly. The reliability of participants’ scores (Cronbach’s α) was 0.80.

Reading comprehension measure

A Norwegian adaptation of the deep cloze test developed by Jensen and Elbro (2022) was used to measure reading comprehension. The deep cloze test requires that readers draw global, situation level inferences (Kintsch, 1988, 1998) in order to fill in the gaps in short narrative passages. It consists of 34 narrative passages with 2–4-sentences in each passage. In each passage, there is one gap that readers have to refill by choosing among four alternative words provided for that gap. Importantly, correct refilling of each gap can only be based on inferences drawn about the global situation described in the passage, that is, on situation model construction (Kintsch, 1988, 1998). Participants were given 10 min to read the passages and refill as many gaps as possible. Scoring involved counting the number of correctly refilled gaps, such that the possible maximum score was 34. The reliability of participants’ scores (Cronbach’s α) was 0.81. For further description of this measure and discussion of preliminary validation efforts, see the Introduction above.

Course achievement

Participants in this study attended a semester-long course in Developmental Psychology (30 ECTS). According to course evaluations, they considered this course the most challenging in their bachelor program due to the amount of reading as well as the difficulty level of the reading materials (textbooks and original research papers). The course examination had the form of a six-hour home exam with all written aids (e.g., course materials, personal course notes, and so forth) allowed. Students were given three essay questions and had to select and answer two of them. The course exam was graded on a scale from A (Excellent) to F (Fail), with A scored as 5, B as 4, and so forth, in the current study.

Procedure

The data were collected in class by the second and third authors during regular seminars. Participants completed the measures on paper in the following order: the working memory measure, the word recognition measure, the deep cloze reading comprehension measure, the cognitive reflection test, and the demographic survey.

Results

Descriptive information and correlations for all variables are included in Table 1. As can be seen, language background (r = 0.29, p = 0.007), cognitive reflection (r = 0.24, p = 0.022), working memory (r = 0.47, p < 0.001), and word recognition (r = 0.44, p < 0.001) were all positively and statistically significantly correlated with scores on the reading comprehension measure. Further, working memory (r = 0.35, p = 0.001) and word recognition (r = 0.36, p = 0.001), as well as reading comprehension (r = 0.33, p = 0.003), were positively and statistically significantly correlated with achievement in the course (i.e., course grades).

Table 1 Descriptive information and zero-order correlations in Study 1

We first performed a simultaneous multiple regression analysis using language background, cognitive reflection, working memory, and word recognition as predictors and scores on the reading comprehension measure as the dependent variable. Table 2 shows that the four predictors together explained 41% of the variance in reading comprehension, F(4, 78) = 13.63, p < 0.001. This can be considered a large effect in multiple regression analysis (Cohen, 1988). Language background (β = 0.18, p = 0.049), working memory (β = 0.33, p = 0.001), and word recognition (β = 0.34, p < 0.001), but not cognitive reflection (β = 0.16, p = 0.079), were statistically significant unique positive predictors of reading comprehension.

Table 2 Results of simultaneous multiple regression analysis predicting deep cloze reading comprehension in Study 1

Next, we explored the possibility that there might be an interactive effect of working memory and word recognition on reading comprehension, for example, because a combination of higher working memory and higher word recognition might be particularly beneficial, or because higher working memory might compensate for poorer word recognition (or vice versa). To explore this possibility, we created a variable representing the cross-product multiplicative term between working memory and word recognition and entered this variable in the second step of a hierarchical multiple regression analysis, after controlling for language background, cognitive reflection, working memory, and word recognition. Following, Aiken and West (1991), we created the interaction term and performed the regression analysis after having centered the working memory and word recognition variables in order to prevent multicollinearity between the first-order terms (i.e., working memory and word recognition) and the interaction term (i.e., working memory x word recognition). The other variables were left in their original metrics. This analysis showed that the addition of the interaction term in step two did not result in any increment in the explained variance for reading comprehension, Fchange(1, 77) = 0.38, p = 0.54 (β = − 0.06, p = 0.54, for the interaction term). This suggests that the effects of working memory and word recognition on reading comprehension were additive rather than interactive.

Finally, we conducted a simultaneous multiple regression analysis using language background, cognitive reflection, working memory, word recognition, and the reading comprehension measure as predictors and participants’ course grades as the outcome variable. Table 3 shows the results of this analysis. The five predictors together explained 20% of the variance in course achievement, F(5, 73) = 3.60, p = 0.006, which can be considered a medium effect (Cohen, 1988). However, none of the predictors explained a statistically significant unique portion of the variance in course achievement, likely due to intercorrelations among the predictors (see Table 1).

Table 3 Results of simultaneous multiple regression analysis predicting course achievement in Study 1

Discussion

This study validated the Norwegian version of the deep cloze reading comprehension measure by showing that language background, cognitive reflection, working memory, and word recognition, as expected, were positively related to participants’ scores on this measure, and that three of these variables (i.e., language background, working memory, and word recognition) also explained unique portions of the variance in reading comprehension as assessed with the deep cloze test. Further, the results suggested that the potential effects of working memory and word recognition on scores on the deep cloze test may be additive rather than interactive. Finally, although scores on the deep cloze test, as expected, were positively correlated with participants’ course achievement, the measure did not emerge as a unique predictor of course achievement when language background, cognitive reflection, working memory, and word recognition were controlled for. It should be noted, however, that course achievement can be regarded as a distal outcome measure in relation to reading comprehension. In Study 2, we therefore included a more proximal outcome measure targeting participants’ understanding of an issue discussed across multiple texts that participants read just before their understanding was assessed. Also, participants from several programs in addition to special education were included in Study 2, with all data collected individually instead of in the class.

Study 2

The purpose of this study was to replicate some of the findings from Study 1 in a different sample and a different setting for data collection. In particular, we asked whether the findings suggesting that language background and working memory may underlie students’ performance on the deep cloze reading comprehension measure may be supported when students are recruited from a broader student population and when data are collected individually rather than in class. Additionally, we examined whether scores on the reading comprehension measure may uniquely predict understanding of multiple texts as evidenced by an open-ended written assessment when language background, working memory, and prior knowledge are controlled for. Based on Study 1, we expected that both language background and working memory would explain a unique portion of the variance in students’ reading comprehension scores. Further, based on the assumption that situational level comprehension of each single text is needed to construct an integrated understanding of an issue discussed in multiple texts, as well as prior research linking single text comprehension to the understanding of multiple texts (e.g., Florit et al., 2019; Mahlow et al., 2020), we expected that scores on the deep cloze reading comprehension measure would uniquely predict the understanding of multiple texts when the other relevant predictors were controlled for.

Participants

One-hundred and thirty-four students enrolled in different programs at the University of Oslo participated in this study. Their programs included arts and humanities, education, social sciences, informatics and mathematics, and special education. None of these students participated in Study 1. The vast majority (97%) were bachelor students and the rest (3%) were master students. Their mean age was 24.03 years (SD = 6.40), and 76.9% identified as female, 18.7% as male, and 3.0% as other (2 participants did not report on gender identification). The language background of most participants (67.2%) was only Norwegian, but 17.9% had another language background, and the language background of 14.9% of the participants was Norwegian and another language (i.e., a mixed language background). All these participants completed a demographic survey as well as measures of working memory and reading comprehension (i.e., the deep cloze test), allowing us to compare scores on the reading comprehension measure with scores on relevant background (i.e., language background) and cognitive (i.e., working memory) variables. The collection and handling of the data were approved by the Norwegian Social Science Data Services.

In a subsample consisting of 67 participants who also completed a prior knowledge measure before reading four texts on issues concerning sun exposure and health and afterwards completed a written assessment of their understanding of the issues discussed in the texts, we examined the unique predictability of the deep cloze test for their understanding of the texts after controlling for language background, working memory capacity, and prior knowledge. Students in this subsample were enrolled in all the programs mentioned previously, with 19.2% enrolled in arts and humanities, 22.1% in education, 26.4% in social sciences, 1.5% in informatics and mathematics, and 30.8% in special education. Thirty-one participants were first-year bachelor students, 16 were second-year bachelor students, 18 were third-year bachelor students, and 2 were enrolled in master level programs. Their overall mean age was 23.65 years (SD = 5.35), and 77.6% identified as female, 22.4% as male. The language background of most participants (63.2%) was only Norwegian, but 20.6% had another language background, and the language background of 16.2% of the participants was Norwegian and another language (i.e., a mixed language background).

Participants in this study also contributed to data analyzed in a study by Haverkamp et al. (2024) on the effects of multitasking on multiple text comprehension. Research questions, analyses, and findings reported in the present article are unique to this study, however.

Materials

Demographic survey

We used a brief demographic survey to gather information about participants’ age, gender identification (“with which gender do you identify the most?”), study program, and language background. With respect to language background, we asked participants in which language their parents talked to them when they grew up, using the three response categories of Norwegian, another language, or another language and Norwegian. For the statistical analyses, the two latter categories were collapsed, with only Norwegian coded as 1 and another language with or without Norwegian coded as 0.

Working memory measure

The working memory measure used in this study was identical to the measure used in Study 1. Cronbach’s α was 0.87 for the entire sample; for the subsample, it was also 0.87.

Reading comprehension measure

We used the same Norwegian adaptation of the deep cloze test developed by Jensen and Elbro (2022) as in Study 1. Cronbach’s α was 0.84 for the entire sample; for the subsample, it was 0.86.

Prior knowledge measure

In the subsample, prior knowledge about the topics discussed in the text materials was assessed with a 17-item multiple-choice measure that has been used and validated in prior research (e.g., Stang Lund et al. 2019). This measure targeted knowledge about the broader issue of sun exposure and health, including items referring to information and concepts that were relevant to the content of the four texts. Thus, the items covered sun exposure in relation to both physical and mental health (e.g., skin cancer, production of vitamin D, depression, and sleeplessness). Participants’ scores were the number of correct responses (maximum score = 17). The internal consistency reliability for participants’ scores (Cronbach’s α) was 0.72.

Text materials

Participants in the subsample read four texts on the issue of sun exposure and health. Two of the texts presented different perspectives on the issue of sun exposure and physical health, and the two other presented different perspectives on the issue of sun exposure and mental health. The different perspectives were based on authentic reading materials about this controversial socio-scientific issue (Moan et al., 2012). The four texts were adapted, longer versions of texts used in prior research on multiple text comprehension (Delgado et al., 2020; Stang Lund et al., 2019). The length of the texts ranged from 600 to 612 words (M = 606.50, SD = 5.00). With respect to readability scores, which were computed using Björnsson’s (1968) formula, the texts ranged from 40 to 49 (M = 45.25, SD = 4.11). This indicates that the difficulty level of the texts was similar to the difficulty level of informational texts published by the Norwegian government (Vinje, 1982). Source information was presented at the top of each text, including the author’s name, credentials, and affiliation, as well as the date of publication and the publication venue.

Regarding the two texts on sun exposure and physical health, one text presented and elaborated the perspective that sun exposure is harmful because it may cause skin cancer, whereas the other document presented and elaborated the perspective that sun exposure is healthy because it may protect against all forms of cancer through the production of vitamin D. For example, the perspective that sun exposure may lead to skin cancer was elaborated in terms of the underlying mechanism (UV-radiation can damage DNA) and the types of skin cancer that may occur (basal-cell carcinoma and melanoma), and the perspective that sun exposure may protect against all forms of cancer was elaborated in terms of the underlying mechanism (cells use vitamin D to stay normal) and types of cancer and other illnesses vitamin D may protect against (e.g., colon cancer, osteoporosis). Regarding the two texts on sun exposure and mental health, one text presented and elaborated the perspective that lack of sun exposure may lead to depression, whereas the other text presented and elaborated the perspective that lack of sunlight may lead to sleeplessness (but not depression). For example, the perspective that lack of sun exposure may cause depression was elaborated in terms of the underlying mechanisms (decrease in serotonin and increase in melatonin) and the type of depression that may occur (seasonal affective disorder), and the perspective that lack of sunlight may lead to sleeplessness was elaborated in terms of underlying mechanisms (changes in the secretion of melatonin and disturbance of the diurnal rhythm) and evidence against the theory that lack of sunlight causes depression.

Text understanding measure

We measured participants’ understanding of the texts by means of post-reading written reports explaining the relationship between sun exposure, health, and illness. For each of the four perspectives presented and elaborated in the texts, two regarding sun exposure and physical health and two regarding sun exposure and mental health (see description of the texts), participants were awarded 0–2 points. A score of 0 was given if a perspective (e.g., that sun exposure is harmful because it may lead to skin cancer) was not represented in the report, a score of 1 was given if a perspective was represented but not elaborated, and a score of 2 was given if a perspective was both represented and elaborated (e.g., by referring to the mechanism by which sun exposure may lead to skin cancer).

In addition, participants were awarded 0–2 points for integration of the two perspectives concerning sun exposure and physical health and 0–2 points for integration of the two perspectives concerning sun exposure and mental health. A score of 0 was given if there was no attempt to compare, contrast, or connect two of these perspectives, a score of 1 was given if two of these perspectives were compared, contrasted, or connected (e.g., by acknowledging that sun exposure may have both negative and positive effects on physical health) but such integration was not explicitly elaborated, and a score of 2 was given if attempts to compare, contrast, or connect two of these perspectives were explicitly elaborated (e.g., by trying to reconcile the two perspectives by weighing the risk of skin cancer against the need to obtain sufficient vitamin D).

Finally, participants were awarded 0–2 points for integration of perspectives across the two issues discussed in the texts (i.e., the issue of sun exposure and physical health and the issue of sun exposure and mental health). A score of 0 was given if there was no attempt to integrate perspectives across the two issues, a score of 1 was given if perspectives across the two issues were compared, contrasted, or connected (e.g., by acknowledging that both physical and mental health may be influenced by UV-radiation from the sun) but such integration across issues was not elaborated, and a score of 2 was given if integration of perspectives across the two issues was explicitly elaborated (e.g., by explaining how lack of sunlight during winter might involve a risk of physical illness due to less vitamin D and a risk of mental illness due to less serotonin).

In summary, the total scores on our written comprehension assessment could possibly range from 0 to 14, with 0–8 points awarded for representing and elaborating the four perspectives discussed in the texts and 0–6 additional points awarded for integrating information across the perspectives and issues discussed in the text set. High scores on this measure can thus be said to reflect an elaborated and integrated understanding of the texts’ content. Only the total scores on this measure were used in the statistical analyses.

The first and second authors scored all the written reports. After having scored 10 reports collaboratively to develop the scoring system, they scored a random selection of 13 reports independently. The independent scoring resulted in a somewhat lower than desirable, yet acceptable (Landis & Koch, 1977) Cohen’s Kappa of 0.65, and a high correlation between the two raters’ total scores (Pearson’s r = 0.89). All disagreements were solved in a thorough discussion in which the requirements for receiving 2 points on the perspective regarding sun exposure and skin cancer and for receiving 1 and 2 points, respectively, on the perspective regarding sun exposure and vitamin D were further specified (these points were responsible for most of the disagreements). The two raters then scored the remaining reports separately.

Procedure

Data for this study were collected individually by the second author and a trained assistant in a quiet room at the university. All participants first completed the working memory measure and then the reading comprehension measure on paper before they completed the demographic survey. Participants in the subsample also completed the prior knowledge measure in a web-based questionnaire before the four texts were introduced in this way in the questionnaire:

You are now going to read four texts about sun exposure, health, and illness that altogether consist of 2500 words. Afterwards, you are going to write a brief report based on these texts in which you explain the relationship between sun exposure, health, and illness. You cannot look back to the texts when writing your report.

The issue of sun exposure and physical health and the issue of sun exposure and mental health were presented in counterbalanced order, as were the two texts discussing each issue. After finishing reading the texts, participants read the following writing instruction:

There are different points of view on the relationship between the amount of sun exposure, health, and illness. You are now going to write a report in which you explain important similarities and differences between these points of view. Base your report on the texts you just read and try to express yourself as clearly and completely as possible, preferably in your own words.

Participants completed their report in a textbox with no word limit that was located right below this writing instruction and submitted it to a server when finished.Footnote 1

Results

Table 4 displays descriptive and correlational data for the entire sample. Both language background (r = 0.43, p < 0.001) and working memory (r = 0.40, p < 0.001) were positively and statistically significantly correlated with scores on the reading comprehension measure. Also, language background and working memory were positively correlated (r = 0.20, p = 0.018).

Table 4 Descriptive information and zero-order correlations for the entire sample in Study 2

We conducted a simultaneous multiple regression analysis using language background and working memory as predictors and scores on the reading comprehension measure as the dependent variable. Table 5 shows that the two predictors together explained 29% of the variance in reading comprehension, F(2, 130) = 26.06, p < 0.001, which can be considered a large effect (Cohen, 1988). Both language background (β = 0.36, p < 0.001) and working memory (β = 0.32, p < 0.001) uniquely and positively predicted reading comprehension.

Table 5 Results of simultaneous multiple regression analysis predicting deep cloze reading comprehension for the entire sample in Study 2

The results of descriptive and correlational analyses for the subsample are displayed in Table 6. As can be seen, working memory (r = 0.30, p = 0.016), prior knowledge (r = 0.47, p < 0.001), and reading comprehension (r = 0.49, p < 0.001) were positively and statistically significantly correlated with understanding of the text materials, whereas the correlation between language background and text understanding did not reach a conventional level of statistical significance when a two-tailed test was used with this sample size (r = 0.20, p = 0.10). However, language background was statistically significantly correlated with both prior knowledge (r = 0.27, p = 0.030) and the reading comprehension measure (r = 0.45, p < 0.001).

Table 6 Descriptive information and zero-order correlations for the subsample in Study 2

We performed a hierarchical multiple regression analysis for the subsample, using understanding of the textual materials as the dependent variable. Language background, working memory, and prior knowledge were entered into the equation in step one. In step two, we included participants’ scores on the reading comprehension measure (i.e., the deep cloze test). The results of this regression analysis are shown in Table 7. The variables entered in step one explained a statistically significant amount of variance in text understanding, with R2 = 0.25, F(3, 62) = 7.00, p < 0.001. In this step, prior knowledge (β = 0.40, p = 0.001), but not language background (β = 0.06, p = 0.583) or working memory (β = 0.18, p = 0.12), was a unique and statistically significant positive predictor of text understanding. After step two, with scores on the reading comprehension measure also included in the equation, R2 = 0.32, Fchange(1, 61) = 6.17, p = 0.016. Thus, the addition of this measure resulted in a statistically significant 7% increment in the explained variance. In this step, prior knowledge remained a unique and statistically significant positive predictor of text comprehension (β = 0.30, p = 0.016) in addition to reading comprehension (β = 0.34, p = 0.016). An R2 of 0.32 can be considered a large effect in multiple regression analysis (Cohen, 1988).

Table 7 Results of hierarchical multiple regression analysis predicting text understanding for the subsample in Study 2

Discussion

This study replicated the unique predictability of language background and working memory for scores on the deep cloze reading comprehension measure that we found in Study 1. Further, the results of this study showed that scores on this measure contributed substantially to participants’ understanding of complex text materials over and above language background, working memory, and even prior knowledge about the issue discussed in the texts. These results further support the validity of the deep cloze test as a measure of situation level comprehension drawing on language skills and working memory resources, and contributing uniquely to an integrated understanding of more complex text materials. Importantly, these results were obtained with participants recruited from a broader student population than in Study 1 and with the data collected individually rather than in class, which also speaks to the generalizability of our findings.

General discussion

Given the essential role of reading comprehension both in and out of school (UNICEF, 2023), researchers and educators are constantly looking for effective and efficient ways to measure this complex construct. Especially, measuring reading comprehension at a level that goes beyond what is explicitly stated in the text (i.e., at the level of the situation model; Kintsch 1998) may be quite labor-intensive when constructive response formats such as open-ended questions and essay tasks are used. At the same time, however, it has been a somewhat open question to what extent deeper level reading comprehension actually can be measured in the more efficient, easy, and unambiguous ways represented by selected response formats, including cloze tests. Arguably, the pioneer work conducted by Elbro and colleagues (Gellert & Elbro, 2013; Jensen & Elbro, 2022) to develop and provide preliminary validation data for cloze tests targeting deeper level, inferential understanding, has brought the research community closer to a positive answer to that question. In particular, this goes for their efforts to develop a deep cloze test targeting global situational understanding, which corresponds to situation model representation within the construction-integration model proposed by Kintsch (1988, 1998). However, despite the promising results initially obtained with the deep cloze reading comprehension test (Jensen & Elbro, 2022; Salmerón et al., 2022), more research is obviously needed before researchers and educators can draw inferences regarding deeper level understanding based on readers’ scores on this measure. In this study, we therefore set out to gain further insights into the nomological network within which the construct measured by the deep cloze test works.

Our first research question concerned potential contributors to readers’ performance on the deep cloze test. In two independent studies, we found that readers’ language background and working memory explained unique portions of the variance in their scores on this test, and in one of these studies, we showed that both language background and working memory contributed uniquely to test performance when word recognition also was included in the analysis. As expected, word recognition also was a unique predictor in the latter study. The finding that both word level reading processes and working memory resources seemed to be involved in readers’ performance on the deep cloze reading comprehension test is consistent with the construction-integration model by Kintsch (1988, 1998), with which the design of this measure is well aligned. Finally, the positive correlation between scores on the cognitive reflection test and performance on the deep cloze test is consistent with the view that the items of the deep cloze test may require some reflective judgment and decision-making (Kahneman, 2011) on part of the readers.

Our second research question concerned the predictability of deep cloze test performance for students’ course achievement and integrated text understanding as assessed with an open-ended written comprehension assessment. First, although we found a medium large positive correlation (Cohen, 1988) between scores on the deep cloze test and students’ course achievement, this relationship did not survive control for the other variables included in Study 1 (i.e., language background, cognitive reflection, working memory, and word recognition). However, when a more proximal, open-ended measure of integrated text understanding was used as the outcome in Study 2, scores on the deep cloze test improved prediction of expository-text understanding beyond that offered by language background, working memory, and prior knowledge about the topic discussed in the reading materials. This finding is also consistent with theory (List & Alexander, 2019) and prior research (Florit et al., 2019; Mahlow et al., 2020) on multiple text comprehension, highlighting the importance of creating a coherent representation of each single text for the ability to draw inferences and construct meaning across multiple texts.

Taken together, our findings support an interpretation of the deep cloze reading comprehension test as an effective and efficient measure of situation level understanding that draws on language skills, word level processes, and working memory resources and also can be used to predict students’ performance on important criterial tasks requiring deeper level understanding. A limitation of our findings is, however, that questions regarding causal relationships cannot be answered by the correlational data that we used in these studies. Thus, although the relationships that we observed, in accordance with theoretical assumptions and prior research, can be interpreted in terms of potential contributors to and consequences of performance on the deep cloze reading comprehension test, firmer conclusions regarding the direction of these relationships require other (i.e., longitudinal or experimental) approaches. Further, although participants were recruited from a somewhat broader student population in Study 2 than in Study 1, with data also collected individually rather than in class in Study 2, other populations of readers and other reading contexts should be included in future research on the deep cloze reading comprehension test. Because our samples were rather small convenience samples with a clear majority identifying as females, further research using samples that are larger and more representative of the university community is desirable, as is further research on non-university populations. Importantly, research conducted in other cultural contexts is also needed to probe generalizability because quite a few of the situations decribed in the test paragraphs may be culture sensitive if not culture specific. Despite such limitations and the need for further research, however, the current studies strongly indicate that the deep cloze test is a theoretically well grounded, effective, and efficient assessment tool that can be employed by researchers and educators alike to better understand the reading comprehension skills of young adult readers.