Introduction

There is a proliferation of using Generative Artificial Intelligence (GenAI) in education. Trained by large language models, GenAI mimics human intelligence and creates new outputs such as images and texts according to users’ input (Chiu, 2023; Dwivedi et al. 2023). Large language models like GPT-3 are trained by a large amount of data to create human-like texts that are seemingly accurate (Floridi & Chiriatti, 2020). GenAI tools have been changing education in terms of teaching, learning, assessment, and administration (Chiu, 2023). When GenAI tools are incorporated in teaching and learning of science, students need to critically comprehend ChatGPT-generated texts (Tang, 2024), as there are bias in the data trained in large language models (Krist and Kubscsh, 2023).

Science classrooms began to use generative Artificial Intelligence (GenAI) texts (e.g., ChatGPT or Google Gemini Bard) to facilitate learning of science (Cooper, 2023; Bitzenbauer, 2023) especially climate change. ChatGPT, which is a GenAI tool that adopts natural language processing models to generate seemingly high-quality texts based on human input (Qin et al. 2023), is capable of communicating complex scientific information in simplified texts (Herget & Alegre, 2023). As educational materials, by providing these simplified scientific texts, outputs in ChatGPT help individuals understand environmental risks to their health by analysing big data on pollution, exposure to hazardous substances and climate change (Agathokleous et al. 2023; Biswas, 2023). However, there were some shortcomings offered by ChatGPT to educating students about scientific research on climate change. For example, ChatGPT’s outputs are based on calculation of probability of outcomes in natural language processing instead of logical reasoning (Dwivedi et al. 2023).

A majority of science education studies focused how to embed ChatGPT into assessing students’ scientific understanding (Zhai et al. 2022; Zhai & Nehm, 2023), but there were only a few studies revealing how students critically drew on different domains to read outputs in a setting generated by artificial intelligence tools such as ChatGPT. Science education should prepare students to become a “critical consumer” of arguments and evidence around climate change (Osborne & Pimentel, 2023), as it would influence their collective actions that mitigate climate change (Van der Linden et al. 2017). In a similar vein, students should not simply uptake scientific information from ChatGPT output, rather they should critically engage in these scientific texts by interpreting the scientific elements, developing awareness of linguistic features of ChatGPT outputs, and drawing on their epistemic beliefs about science and GenAI. As Yore and Tang (2022) argue, students’ reading of scientific texts is not only limited to extracting and reasoning information, but students developed an awareness of the type of texts and drew on their epistemologies of science to comprehend these texts. Scientific texts read by students were not only limited to those in textbooks (Tang, Lin et al. 2022), while these texts can be digital texts or ChatGPT outputs. In the context of reading ChatGPT outputs, students drew on their epistemic knowledge of both science and artificial intelligence to reason this scientific information (Billingsley et al. 2023; Cheung, Pun and Fu, 2023). Holistic reading, a term defined by Cheung et al. (2024), refers to that students interpret disciplinary text segments as well as developing a genre and epistemic reasoning of these text segments.

Although outputs of ChatGPT’s scientific texts are different according to individuals’ input of responses, using an instrument with the same set of questions that consistently investigate students’ real-time holistic reading of ChatGPT outputs might need further work. Despite this limitation, our study is the first attempt that situates students in reading scientific texts in a ChatGPT setting by screen-capturing two sets of an input command and an output response in ChatGPT interface with a set of questions targeting the three domains, while the input command concerns controversial conceptions about climate change. As epistemologies of science is situated in the cognitive-epistemic and social-institutional systems (Erduran & Dagher, 2014), the instrument comprises two ChatGPT conversations, with a conversation related to the cognitive-epistemic debate of climate science and the other related to the social-institutional debate of climate science.

Regardless of such limitation, this study contributes to science education literature by theorizing essential dimensions related to holistic reading of scientific texts in a ChatGPT’s setting and investigates students’ holistic reading of these scientific texts related to climate change. A construct was established to represent three domains of reading scientific texts generated by ChatGPT, the content-interpretation (CI) domain, the genre-reasoning (GR) domain and the epistemic-evaluation (EE) domain. As an initial attempt, such a construct can potentially contribute to science curriculum by layering out the set of performance domains that push the incorporation of GenAI technologies into reading in science. Using the Rasch partial-credit model, we adopt a construct modelling approach (Wilson, 2023) to describe how we developed a construct map, designing items, setting outcome spaces and running measurement models. Our research questions below guide the present study:

RQ1

What is students’ performance in reading scientific texts on climate change in a ChatGPT scenario, specifically in the content-interpretation, genre-reasoning and epistemic-evaluation domains?

RQ2

How did students’ performance in holistic reading socio-scientific texts in a ChatGPT scenario change after a reading-science intervention?

Framing this Study

Holistic Reading of Socio-Scientific Texts on Climate Change Debates

Students’ reading of socio-scientific texts is a critical part of science communication during climate crisis. Socio-scientific issues in reading texts can be defined as encompassing two elements, connections to science in conceptual and/or procedural domains and with social significance (Sadler, 2004). Climate change is considered as a socio-scientific issue because it is ill-structured, multidimensional, and complex in relation to both science and society (Peel et al. 2017).

Previous science education studies focused on how students detected and reasoned information from socio-scientific texts on climate change (e.g., Strømsø et al. 2010), while only recent studies shift such focus to students’ disciplinary and epistemic reading socio-scientific texts on climate change (e.g., Fazio et al. 2022). For disciplinary reading, there are specific conventions within a discipline while readers need to critically interpreting the content and habits of mind of that discipline (Fang & Coatoam, 2013; Shanahan & Shanahan, 2012). In contrast with the grammatical structures in applied linguistics (e.g., clauses and verbs), when teachers and students engage in science communication, they use some specialized terms in science, namely laws, theories and hypothesis (Tang, 2021; Tang & Rappa, 2021). For epistemic reading, readers need to draw on their understanding of how knowledge within disciplines are generated to interpret and evaluate the socio-scientific texts (Yore & Tang, 2022). More specifically, students need to evaluate claims in socio-scientific texts by considering source, certainty, development, and justification of knowledge presented in the socio-scientific texts (Chan, Cheung & Erduran, 2023; Cheung et al. 2023a, 2024). Engaging in students’ reading disciplinary scientific texts was also demonstrated to be associated with (Cheung, 2024) and improve their epistemic belief about science (Chen et al. 2022).

Reading Scientific Texts in a ChatGPT Scenario

The potential of AI-generated (e.g., ChatGPT) narratives on climate change in promoting public awareness and engagement has triggered two contrasting sides of argument (Gursesli et al. 2023; Rane et al. 2023). On one hand, ChatGPT can provide a personalized learning experience (Rane et al. 2023) for students to acquire more knowledge about socio-scientific issues such as climate change. On the other side, a few concerns had been raised regarding the offsets of ChatGPT in promoting public engagement in climate crisis. Firstly, using large language models, ChatGPT can provide some references but do not provide accurate in-text citations (Agathokleous et al. 2023). This can be attributed to the reason that ChatGPT paraphrases sentences from large language models and do not formulate ideas by its own (Agathokleous et al. 2023). ChatGPT can provide satisfactory but not totally correct answers in communicating scientific information (Deiana et al. 2023; Salas et al. 2023). More specifically, ChatGPT lacked contextual awareness of how questions regarding climate change were asked, while its data inherited bias because ChatGPT was trained by large datasets available in the market (Biswas, 2023). If students read ChatGPT-generated scientific texts about climate change, they should be critical about how ChatGPT obtain this scientific information based on students’ own epistemic understanding of GenAI.

Secondly, as ChatGPT outputs comprises a mix of genres and human-like responses (Fui-Hoon Nah et al. 2023; Kuzman et al. 2023), students need to interpret whether ChatGPT was explaining a scientific phenomenon or providing a human-like argument. Such a  differentiation is important for learners to decide what to trust, and which part of texts is originated from ones’ perspectives on climate change issues. According to AlAfnan and MohdZuki (2023), when ChatGPT-4 was asked to produce academic writings and business correspondences, sentences formed ChatGPT-4 is mostly imperative in mood and are dominated by simple present tense and third-person pronouns. Also, the lexical density and the reading difficulty of ChatGPT-4 outputs are low (AlAfnan & MohdZuki, 2023). Compared to human writings, ChatGPT provides texts with fewer modals, epistemic and discourse markers (Herbold et al. 2023). Based on large language models, texts might comprise both explanations and argumentation which might lead to confusion of whether ChatGPT communicates why things happen or provide a human-like scientific opinion.

Conceptual Framework

Reading scientific texts is holistic in nature (Cheung et al. 2024) which encompasses different domains. There are three domains that underpin this study: (1) the content-interpretation (CI) domain; (2) the genre-reasoning (GR) domain and (3) the epistemic-evaluation (EE) domain) (Fig. 1). A detailed explanation of each level within the construct map is available as Appendix 1 Table S1.

Fig. 1
figure 1

Construct Map of students’ reading of scientific texts in ChatGPT output

Content-Interpretation (CI) Domain

It refers to students’ identification of relationships between elements of the text while these identification of relationships draws on their prior scientific knowledge (Bernholt et al. 2023; Van den Broek, 2010). At a lower level, students identify pieces of key ideas from the scientific texts; while at a higher level, students can express in their own words about the most important information drawing on their prior scientific knowledge (Oliveras et al. 2013, 2014).

Genre-Reasoning (GR) Domain

It refers to students’ ability to identify the type of scientific genres present in the ChatGPT outputs. Genre “is not merely a collection of similar texts, but it is a “higher-level patterning” of language use that both arises from and realizes the repetitive stages and actions in a communicative event” (Tang, Park et al. 2022, p. 756). In the discipline of science, there are four types of genre, namely explanation, argumentation, reporting information and experimental account (Tang, 2023): explanatory texts account for the causes and effects of a scientific phenomenon; argumentative texts describes evidence to justify a debatable claim; information-reporting texts provide organized information about events and things happening in the natural world; experimental-accounting texts list the steps for conducting scientific investigations. Each type of genre is characterized by its own specialized linguistic features, for example, argumentative texts are explicit in addressing human subjects (Tang, 2023).

Determining the type of genre in ChatGPT-generated scientific texts is challenging for students and members of the public, as ChatGPT provided human-like responses (Fui-Hoon Nah et al. 2023) that can easily mask over scientific genres. Depending on the version of ChatGPT or the prompt initiated by users, there are differences in terms of lexical complexity, normalization and modals between human-generated argumentative texts and ChatGPT-generated argumentative texts (Herbold et al. 2023). In a ChatGPT-generated text, there can be a mix of different genres, such as a mix of human-like argumentation and reporting scientific information.

Epistemic-Evaluation (EE) Domain

It refers to how students draw on epistemologies of the fields of AI and science to evaluate scientific claims in ChatGPT. As ChatGPT is based on large language models, it is excellent in answering close-ended questions that require factual recall (Li et al. 2022). However, when ChatGPT was asked some challenging questions such as “Is it still possible to limit warming to 1.5°C?”, it tends to give general answers without an accurate judgement (Vaghefi et al. 2023). On the contrary, ChatGPT was demonstrated to detect fake claims with 100% accuracy (Caramancion, 2023). Users can input claims about climate change and prompt ChatGPT to evaluate whether the claims are justified by scientific evidence. Even ChatGPT gives a seemingly true evaluation of misinformation about climate change, we argue that students need to possess the ability to “evaluate what ChatGPT evaluates”, given the neutral stance adopted by ChatGPT in certain occasion in response to climate claims (e.g., Vaghefi et al. 2023).

When students read what ChatGPT evaluates a scientific claim, they need to have a deeper epistemic understanding of how GenAI and scientific texts are generated. As argued in the seminal review by Cheung et al. (2024) and the exploratory study by Billingsley et al. (2023), students need to understand the interaction between science and GenAI, to reason scientific claims in an informed manner. On one side, students need to be aware the fact that “hallucinate” (non-existent) texts could be generated by ChatGPT as its generation is based on large language models (Inojosa et al. 2023), instead of only searching information from internets. Apart from acknowledging how knowledge is generated by GenAI, students should also evaluate ChatGPT-generated scientific claims by drawing on their epistemic understanding of science. Although there might be chances that ChatGPT provides scientific evidence with justification of multiple sources of experiments, scientific knowledge is subject to change and authority of scientific knowledge can be challenged (Conley et al. 2004; Lederman et al. 2002).

Methodology

Adopting a construct-driven methodology (Wilson, 2023), this study investigates students’ performance on reading ChatGPT-generated outputs by activating their content-interpretation, genre-reasoning and epistemic-evaluation domains. Such methodology allows us to design items that align with the domains of holistic reading, as well as evaluating students’ performance in each domain by specifying the expected level of performance within each item of the domain. A test (Appendix 1 Document 1) was designed to investigate students’ reading performance of comprehending ChatGPT-generated texts, and it was administered to a group of students before and after an intervention focusing on disciplinary and epistemic reading of scientific texts.

Context and Participants of this Study

The context of this study is a large-scale government-funded project which aims to improve Hong Kong junior secondary students (grades 8 and 9)’ disciplinary reading of scientific texts in a range of digital sources. The data reported in this study comes from two Band 1 state schools in Hong Kong, with 55 students from a girls’ school (School A) and 62 students from another mixed school (School B) (Table 1). Both schools use English as the medium of instruction to teach science. Both pre- and post-instruments were administered before and after the programme, while students had 40 min to complete the instrument. However, owing to school activities, students in school B skipped the post-test session. Data present in this paper comprises pre- and post-test data from School A, as well as pre-test data from School B.

Table 1 Demographics of participants

The Reading-Science Intervention Programme

The reading-science intervention programme aims to improve students’ disciplinary and epistemic reading of science (Cheung et al. 2024). The design of this programme is anchored by three major components: (a) understanding scientific texts as a discipline; (b) reasoning nature of scientific texts; (c) noticing epistemic nature of science within scientific texts. In the component of understanding scientific texts as a discipline, students detect and reason key scientific knowledge, learning scientific vocabularies and metalanguages, identifying scientific procedures, evaluating quality of evidence present in the scientific texts; in the component of reasoning nature of scientific texts, students identified different types of scientific genres including explanation, information report, experimental account and argument (Tang, 2023; Tang, Park et al. 2022); in the component of noticing epistemic nature of science within scientific texts, students were explicitly taught to develop awareness of source, justification, certainty, and development of knowledge (Conley et al. 2004; Tsai et al. 2011) present in the scientific texts.

The intervention consists of six 40-minute lessons spreading over six weeks, which are integrated into supplementary sessions specifically designed to enhance students’ English reading skills of scientific texts. Although the primary purpose of the programme is to improve students’ disciplinary and epistemic reading of scientific texts, the programme also has parts that explicitly teach students to notice the interaction between epistemic dimensions of AI and those of science. For example, students were asked to discuss an extract from Scientific American (Blades, 2021) regarding whether AI can generate hypotheses that human never think of. The motive of incorporating such materials tangentially in the program is owing to the emergence of AI technologies in interacting with scientific knowledge (Cheung et al. 2024). The reading-science intervention programme was co-delivered by four Mandarin/Cantonese as a first language teachers (3 females and 1 male) who hold a master’s degree in applied linguistics. They ranged in age from 25 to 27 years (M = 25.5), and each received training in enhancing students’ disciplinary and epistemic reading of scientific texts. As one of the four teachers, the third author pretook a series of online training workshops provided by the first author who is a researcher in science education. These workshops covered topics such as the unique features of scientific texts, different types of scientific genres, and fostering critical evaluation of scientific texts. Afterwards, the third author provided training to the other three teachers in a similar manner, depending on their available schedules.

The Instrument

The instrument provides a scenario where a simulated student read claims about climate change and questions ChatGPT claims about climate change. To mimic the actions of the simulated student, the research team input popular and debatable claims about climate change into ChatGPT, and ChatGPT generates some responses regarding the claims. To situate students in reading texts generated by ChatGPT, the researchers selected these popular and debatable claims into ChatGPT and generated corresponding texts, screen-capturing the entire conversation between the simulated student and the ChatGPT and pasting into the instrument. In the first draft of the instrument, there are capped screens of four ChatGPT conversations, with two conversations focusing on cognitive-epistemic issues related to climate change while another two conversations focusing on the socio-institutional issues related to climate change. As Erduran and Dagher (2014) argued, holistic science in society comprises a cognitive-epistemic system as well as a socio-institutional systems. Hence, our instrument targets students’ reading of socio-scientific texts related to both systems. The instrument, with screen-capturing of four ChatGPT conversations (Conversations A, B, C and D) were piloted twice with 66 participants (male: 35; female: 28; not reported: 3) which the findings were not reported in this study. As students cannot finish reading four ChatGPT conversations about climate change within 40 min, the research team decided to remove two conversations from the instrument. Further justifications of the instrument can be founded in Appendix 1.

Interrater and Instrument Reliability

To ensure interrater reliability, the first and the third authors undertook one-month training to rate 50% of students’ performance in pre-test items according to the stipulated levels in the outcome space. During the process, the authors undertook several rounds of iterations to revise the outcome spaces so that it can capture a wide range of students’ holistic reading ability of ChatGPT-generated socio-scientific texts related to climate change. The inter-rater reliabilities of items range from 0.84 to 0.97 (Appendix 2 Table S1) which indicate a good reliability (Cheung & Tai, 2023).

Rasch analysis was conducted as it revealed the psychometric properties of the tasks and aligned students’ ability with the task item difficulty (Boone et al. 2013). In Rasch analysis, the validity and reliability of instrument were evaluated by five indicators: unidimensionality, item fit statistics, separation reliability, item threshold at each level and Wright map. Unidimensionality of the instrument, which was a key assumption of Rasch analysis, was investigated using Rasch Principal Components Analysis (PCA) of residuals (Boone et al. 2013). If the first contrast of eigenvalues of Rasch PCA is lower than 2.0, the reading instrument can be said to measure a single construct only. Students’ pre-test data was input into R studio 2023.06.0 version and ‘eRm” package (Mair & Hatzinger, 2007) was used to calculate the Rasch residuals of PCA. The results showed a first contrast of eigenvalue of 1.739 which demonstrates a strong one-dimensional structure of the instrument. Hence, students’ reading of socio-scientific text related to climate change generated by ChatGPT measured by the instrument is a single construct. Apart from unidimensionality, unweighted and weighted MNSQ values were used to determine whether individual items aligns with the latent trait per Rasch model (Bond et al. 2020). An unweighted MNSQ values between 0.7 and 1.3 and a weighted MNSQ value between 0.8 and 1.2 were considered as acceptable fits (Tesio, 2003). All items with unweighted and weighted MNSQ values fall within the acceptable range (Table 2).

Table 2 Rasch items and statistic measures for students’ pre-test on holistic reading of ChatGPT-generated socio-scientific texts

Item and person separation reliabilities also yield above 0.70 which indicates an acceptable value (Chang et al. 2014). Other than separation reliabilities, item thresholds of each item are progressive along increasing levels of competence of reading (Table 2). These threshold statistics indicate that each level correctly stipulates students with increasing levels of competence in each item. More importantly, the Wright map (Appendix S2 Fig. 1) shows that the instrument aligns with the person ability of the target population well, as the performance distribution spread along different levels of the items.

Results

Students’ Reading ChatGPT-generated Scientific Texts on Climate Change (RQ1)

A total of 117 students responded to the instrument. Table 2 shows the item measures based on item difficulty parameters of the logit scale that represent the latent reading ability, with a higher measure indicating a more difficult item for the population of students. A similar trend, which was content-interpretation domain being the easiest and epistemic-evaluation domain being the most difficult, was observed across Conversations A (Is global warming real?) and C (scientists do research because of getting government grant money). For example, in Conversation A, the item measures for CA-CI1 and CA-CI2 in the content-interpretation domain are − 1.481 and − 0.198 logits; that for CA-GR1 in the genre-reasoning domain is 0.273 logits; those for CA-EE1 and CA-EE2 in the epistemic-evaluation domain are 0.296 and 0.851 logits.

Items within the same domain were further investigated and visualized using a stacked bar chart (Fig. 2). For content-interpretation domain, more students achieved Level 4 in conversation A (CA-CI1: 35% and CA-CI2: 12%) compared to their performance in Conversation C (CC-CI1: 0%). In other words, more students can detect all key elements of scientific concepts in ChatGPT-generated socio-scientific texts and provide full reasoning on these concepts in Conversation A when compared to their performance in Conversation C. For genre-reasoning domain, the distribution of students across different levels is similar across Conversations A and C, while this is supported by item measures falling into about 0.27 in items CA-GR1 and CC-GR1. This further justifies that regardless of the text’s focus on cognitive-epistemic or socio-political dimensions, students perform the same in reasoning the scientific genres in ChatGPT-generated socio-scientific texts. Most importantly, in epistemic-evaluation domains, there were relatively fewer number of students achieving Level 4 (CA-EE1: 3.4%; CA-EE2 0.9%; CC-EE1: 0.9%) where students need to provide justifications on their belief about a scientific claim based on their understanding of nature of science and nature of AI, as well as drawing connections to information in socio-scientific texts in the ChatGPT output.

Fig. 2
figure 2

Students’ pre-test on holistic reading of ChatGPT-generated socio-scientific texts

Owing to a lower percentage of students achieving Level 4 in epistemic-evaluation domains in both types of socio-scientific texts, we conducted further qualitative inductive coding on what nature of science and AI students draw on when students evaluate information in a ChatGPT setting (shown in Fig. 3). According to Fig. 3(a), most students considered tentative (CA-EE1: 8.5%; CA-EE2: 4.3%; CB-EE1: 13.7%) and empirical nature of science (CA-EE1: 11.1%; CA-EE2: 7.7%; CB-EE1: 11.1%) when they evaluated the texts generated by ChatGPT. Particularly, students expressed an informed understanding that scientific knowledge is subject to change and is based on evidence when they evaluated texts related to climate change in a ChatGPT setting.

Moreover, according to Fig. 3(b), students tend to evaluate texts related to climate change generated by ChatGPT by the reason that “AI cannot know everything” (CA-EE1: 4.3%; CA-EE2: 7.7%; CC-EE1: 4.3%). Although this reason is rather generic and can be applied to the domain of science as well, students also expressed some AI-specific considerations when they evaluate texts related to climate change. These considerations include “AI has societal influence” (CA-EE1: 0.9%), “AI searches information from the Internet” (CA-EE1: 1.7%; CC-EE1: 2.6%), “AI is not up to date” (CA-EE1: 1.7%; CA-EE2: 1.7%; CC-EE1: 6.84%). What is more important, students considered the extent of whether AI resembles human when they evaluated the texts in a ChatGPT setting. For example, they expressed that “AI is not smart enough” (CC-EE1: 0.9%) and “not a human” (CA-EE1: 1.7%; CC-EE1: 1.7%). Interesting, when compared to evaluation of cognitive-epistemic text on climate change (CA-EE1: 1.7%; CA-EE2: 1.7%), more students (CC-EE1: 6.8%) thought that ChatGPT was not up to date in communicating socio-institutional text on climate change. However, these features of AI considered by students do not take full account of ChatGPT, including the fact that the texts generated are influenced by the large language models set by the developers.

Fig. 3
figure 3

Students’ consideration of (a) nature of science and (b) nature of AI in evaluating socio-scientific texts in ChatGPT settings (n = 117)

Change in Students’ Holistic Reading of ChatGPT-Generated Socio-Scientific Texts after a Reading-Science Intervention (RQ2)

To further examine if students’ holistic reading of ChatGPT-generated socio-scientific texts changed after a reading-science intervention, we conducted Wilcoxon signed-rank test at α = 0.05 on each item in a sample of 55 students who have completed both the pre- and post-tests. The reason of not conducting paired sample t-tests is due to unsatisfied assumption of normality (Schucany & Tony Ng, 2006). In order to conduct paired sample t-tests, the skewness and kurtosis of items need to fall between the threshold range between + 2 and − 2 suggested by Brown (2015). However, the kurtosis of students’ performance in two pre-test items is above + 2, with the kurtosis of CA-CI1 being 4.021 and the skewness of CA-EE1 being 2.852. Hence, we used Wilcoxon signed-rank test to compare students’ change in reading performance of each item, as Wilcoxon signed-rank test can be applied to non-parametric data (Woolson, 2007).

Table 3 shows the mean of students’ pre- and post-test, as well as the significance value at α = 0.05 level. After a reading-science intervention, participants made a substantial gain in reading comprehension of socio-scientific texts generated by ChatGPT. The differences in participants’ performance in all items were statistically significant, with p-value being smaller than 0.001 in seven items and that smaller than 0.003 (Item CC-CI1). This evidences that a reading-science intervention can potentially improve all domains of students’ reading of socio-scientific texts about climate change in a ChatGPT setting, namely the content-interpretation, genre-reasoning and epistemic-evaluation domains.

Table 3 Statistical significance for students’ pre- and post-test results for each item using the Wilcoxon signed ranks tests

We also further explored changes in students’ consideration of different aspects of science and AI when they evaluated socio-scientific texts related to climate change in a ChatGPT setting (Table 4). Regarding the cognitive-epistemic text (Conversation A), there was an increasing percentage of students considering tentative nature of science after a reading-science intervention that focuses on developing students’ disciplinary and epistemic reading of scientific texts (CA-EE1: +24%; CA-EE2: +20%). Such an increase was not observed in the socio-political text (Conversation C). More importantly, compared to nature of science, there was only a small increase in percentage of students considering different aspects of AI when they evaluated texts in both conversations. For example, more students considered “AI is not reliable” (CA-EE1: +2%; CC-EE1: +2%) and “AI is non-empirical” (CA-EE1: +4%; CC-EE1: +4%) when they evaluated conversations about climate change in the ChatGPT setting. CA-EE1 concerns a ChatGPT’s claim about whether “global warming is real” and CC-EE1 concerns with a ChatGPT’s claim about whether “While government grants do support climate research, these grants are awarded to advance scientific knowledge and inform evidence-based policy decisions, not to promote a particular agenda”. As these claims might be more debatable compared to CA-EE2 (“human-induced global warming is a significant threat to our planet, ecosystems and future generations.”), the intervention is likely to improve students’ ability to discuss reliability and empirical nature of AI to evaluate the claims. Noticeably, there was also a small percentage of students considering that “AI searches information from the Internet” in evaluating ChatGPT’s claims in CA-EE1 (+ 5%) and CA-EE2 (+ 2%) but not in CC-EE1.

Table 4 Epistemic considerations when students evaluate texts generated by ChatGPT

Discussion

This study is an exploratory investigation on how students drew on different domains of reading in comprehending socio-scientific texts related to climate change in a ChatGPT scenario. Particularly, we explored three domains of students’ reading performance, namely the content-interpretation, genre-reasoning and epistemic-evaluation domains (RQ1). We also investigated how a reading-science intervention, which focuses on disciplinary and epistemic reading of scientific texts improved these domains of holistic reading of ChatGPT outputs (RQ2).

Different Domains of Holistic Reading of Socio-Scientific Texts Related to Climate Change in a ChatGPT Scenario

Content-interpretation domain was found to be easier while the epistemic-evaluation domain was found to be most difficult across both socio-scientific texts (Conversations A and C). Hence, we proposed a hierarchal structure (Fig. 4) that delineates varied students’ performance in different domains of reading of socio-scientific texts in a ChatGPT scenario. Such results were concurrent with the Cheung et al. (2024)’s study that the content domain was the easier while the epistemic domains was the most difficult. However, the novel contribution offered by this study was that students’ reading performance in the genre-reasoning domain was higher than that of the epistemic-evaluation domain and lower than the content-interpretation domain. As ChatGPT generated socio-scientific texts based on large language models (Rocha et al. 2023), it is difficult for ChatGPT to communicate science with a specific purpose (Inojosa et al. 2023). Hence, this makes students more difficult to identify and reason the type of scientific genre in the socio-scientific texts related to climate change in the ChatGPT scenario. More importantly, it was found that students found it difficult to draw on both nature of science and nature of AI to evaluate texts in ChatGPT outputs. This might be owing to the reason that school science does not have many explicit opportunities to develop students’ epistemological understanding of AI relative to scientific knowledge (Cheung et al. 2024). In terms of developing students’ epistemological understanding of science, scholars have been advocating an explicit-reflective approach which provides structured opportunities for students to reflect on the characteristics of science within the classroom contexts (Khishfe, 2013; Khishfe & Abd-El-Khalick, 2002). As the discussion between the nature of AI and nature of science is still emerging in science education literature (Cheung et al. 2024; Billingsley et al. 2023), there is not any literature to explore how students draw on epistemic domains to holistically read scientific texts generated by ChatGPT. We argue that it is necessary for more empirical research on school science and literacy lessons on incorporating an explicit-reflective approach to allow students reflect on the interaction between nature of science and nature of AI when they read socio-scientific texts by ChatGPT.

Fig. 4
figure 4

A hierarchal level delineating different domains of holistic reading of socio-scientific texts related to climate change in a ChatGPT scenario

Secondly, students expressed some uninformed understanding of how ChatGPT generates texts. For example, the dominant epistemic understanding of AI students expressed was “AI cannot know everything”, without providing further justification why AI cannot know everything. More importantly, some students equated ChatGPT with Google search by expressing that “AI search information from the Internet” (CA-EE1 and CC-EE1). Nevertheless, ChatGPT uses a transformer architecture, coupled with deep learning approach to generate responses regarding the prompts input by the users (AlAfnan & MohdZuki, 2023; Inojosa et al. 2023). Despite these uninformed understanding about nature of AI, students can also express some informed understanding regarding nature of AI, such as the subjective nature of AI, non-up-to-date nature of ChatGPT as well as non-empirical nature of AI. Therefore, the result of this study is fruitful in providing science teachers to understand what students know about nature of AI when they read socio-scientific texts generated by ChatGPT. The teachers and science education researchers can design interventions to address students’ misconceptions about AI when students use ChatGPT in reading to learn science.

The Reading-Science Intervention

The reading-science intervention, which focuses on disciplinary and epistemic reading of scientific texts and partly included development of students’ epistemological understanding of AI, significantly improved all domains of reading socio-scientific texts in the ChatGPT scenario. Nevertheless, the post-means of students’ performance in the genre-reasoning and epistemic-evaluation domains stay between Levels 1 and 2 in both Conversations A and C. In the genre-reasoning domain, students can identify the name and description of the scientific genres present in the ChatGPT-generated socio-scientific texts. However, most students cannot justify these scientific genres in relation to the corresponding linguistic features and stating relevant examples after the disciplinary and epistemic reading-science intervention. In science education literature, studies focused on exploring scientific genres present in curriculum materials (Tang, 2023) and classroom discourse (Tang, 2022). The findings of this study call for the need for teachers to guide students to identify the linguistic features from the socio-scientific texts generated by ChatGPT and determine what scientific genres in each section within the texts.

It is also interesting to note that when students were exposed to more debatable claims on climate change in a ChatGPT scenario, a little increase in number of students were more likely to consider the reliability and empirical nature of AI to evaluate these debatable claims (Tang and Cooper, 2024). For example, in their post-instruction responses to Conversation C, when students read claims regarding the authenticity of climate change and the financial motive of scientists of conducting climate research, more students view such claimed portrayed by AI are unreliable and not supported by empirical evidence generated by AI itself. This might be attributed by the reason that the debatable nature of these claims on climate change is more conductive for students’ improving epistemic understanding of how AI works to generate claims. To be effective, a reading-science intervention might focus on targeting students’ epistemic reading of debatable claims in a ChatGPT scenario. Apart from nature of AI, there were more than 20% students who can draw on the tentative nature of science to evaluate Conversation A, which shows the disciplinary and epistemic reading-science intervention can successfully improve students’ view regarding changing scientific evidence over time regarding research on climate science.

Limitations and Challenges

The major limitation of this study should be acknowledged. Although this study situates students in reading socio-scientific texts regarding climate change, it does not engage students in interacting with ChatGPT. This is because as a deep learning model, ChatGPT will generate various texts according to the prompts by the users which makes consistently measurement of students’ holistic reading of socio-scientific texts difficult. We counterbalanced such limitation by screen-capping our hypothetical interaction between a student and ChatGPT regarding common claims in relation to climate change. Our research team has been working to design an app that incorporates ChatGPT interfaces to consistently measure students’ holistic reading of scientific texts. Another limitation is that only single-sex students participate in the reading-science intervention, and the findings might only be applied to a certain gender. We envisage that research in the future will generate a pedagogical framework which develops students’ understanding of nature of AI and nature of science when students read scientific texts. More importantly, as there is only experimental group in the reading intervention, future research can compare students’ holistic reading of socio-scientific texts in ChatGPT outputs by setting up both an experimental group and a control group.