An empirical analysis of the relationship between nature of science and critical thinking through science definitions and thinking skills

Critical thinking (CRT) skills transversally pervade education and nature of science (NOS) knowledge is a key component of science literacy. Some science education researchers advocate that CRT skills and NOS knowledge have a mutual impact and relationship. However, few research studies have undertaken the empirical confirmation of this relationship and most fail to match the two terms of the relationship adequately. This paper aims to test the relationship by applying correlation, regression and ANOVA procedures to the students’ answers to two tests that measure thinking skills and science definitions. The results partly confirm the hypothesised relationship, which displays some complex features: on the one hand, the relationship is positive and significant for the NOS variables that express adequate ideas about science. However, it is non-significant when the NOS variables depict misinformed ideas about science. Furthermore, the comparison of the two student cohorts reveals that two years of science instruction do not seem to contribute to advancing students’ NOS conceptions. Finally, some interpretations and consequences of these results for scientific literacy, teaching NOS (paying attention both to informed and misinformed ideas), for connecting NOS with general epistemic knowledge, and assessing CRT skills are discussed.


Introduction
Among other objectives, school science education perennially aims to improve scientific literacy for all, which involves being useful and functional for making adequate and sound personal and social daily life decisions. An essential component of scientific literacy is the knowledge "about" science, that is, knowledge about how science works and validates its knowledge and intervenes in the world (along with technology). This study focuses on the knowledge about science, which is often referred to in the literature as nature of science (NOS), scientific practice, ideas about science, etc., in turn, related to a continuous innovative teaching tradition (Vesterinen et al., 2014;Khishfe, 2012;Lederman, 2007;Matthews, 2012;McComas, 1996;Olson, 2018;among others).
On the other hand, some international reports and experts state that critical thinking (CRT) skills are key and transversal competencies for all educational levels, subjects and jobs in the 21st century. For instance, the European Union (2014) proposes seven key competencies that require developing a set of transversal skills, namely CRT, creativity, initiative, problem-solving, risk assessment, decision-making, communication and constructive management of emotions. In the same vein, the National Research Council (2012) proposes the transferable knowledge and skills for life and work, which explicitly details the following skills: argumentation, problem-solving, decision-making, analysis, interpretation, creativity, and others. In short, these and many other proposals converge in pointing out that teaching students to think and educating in CRT skills is an innovative and significant challenge for 21st century education and, of course, for science education. The CRT construct has been widely developed within psychological research. Yet, the field is complex, and terminologically bewildering (i.e., higher-order skills, cognitive skills, thinking skills, CRT, and other terms are used interchangeably), and some controversies are still unresolved. For instance, scholars do not agree on a common definition of CRT, and the most appropriate set of skills and dispositions to depict CRT is also disputed. As the differences among scholars still persist, the term CRT will be adopted hereafter to generally describe the variety of higher-order thinking skills that are usually associated in the CRT literature.
Further, some science education research currently suggests connections between NOS and CRT, arguing that CRT skills and NOS knowledge are related. Some claim that thinking skills are key to learning NOS (Erduran & Kaya, 2018;Ford & Yore, 2014;García-Mila & Andersen, 2008;Simonneaux, 2014), and specifically, that argumentation skills may enhance NOS understanding (Khishfe et al., 2017). In contrast, as argumentation skills are a key competence for the construction and validation of scientific knowledge, other studies claim that NOS knowledge (i.e., understanding the differences between data and claims) is also key to learning CRT skills such as argumentation (Allchin & Zemplén, 2020;Greene et al., 2016;Settlage & Southerland, 2020). Both directions of this intuitive relationship between CRT skills and NOS are fruitful ways to enhance scientific literacy and general learning. Hence, this study aims to empirically explore the NOS-CRT relationship, as the prior literature is somewhat mystifying and its contributions are limited, as will be shown below.

Theoretical contextualization
This study copes with two different, vast and rich realms of research, namely NOS and CRT, and their theoretical frameworks: the interdisciplinary context of philosophy, sociology, and history of science and science education for NOS; and psychology and general education for CRT skills. Both frameworks are summarized below to meet the journal space limitations.
Under the NOS label, science education has developed a fertile and vast realm of "knowledge about scientific knowledge and knowing", which is obviously a particular case of human thinking, and probably the most developed to date. NOS represents the meta-cognitive, multifaceted and dynamic knowledge about what science is and how science works as a social way of knowing and explaining the natural world (knowledge construction and validation). This knowledge has been interdisciplinarily elaborated from history, philosophy, sociology of science and technology, and other disciplines. Scholars raised many and varied NOS issues (Matthews, 2012), which are relevant to scientific research and widely surpass the reduced consensus view (Lederman, 2007). Despite NOS complexity, it has been systematized across two broad dimensions: epistemological and social (Erduran & Dagher, 2014;Manassero-Mass & Vazquez-Alonso, 2019). The epistemological dimension refers to the principles and values underlying knowledge construction and validation, which are often described as the scientific method, empirical basis, observation, data and inference, tentativeness, theory and law, creativity, subjectivity, demarcation, and many others. The social dimension refers to the social construction of scientific knowledge and its social impact. It often deals with the scientific community and institutions, social influences, and general science-technology-society interactions (peer evaluation, communication, gender, innovation, development, funding, technology, psychology, etc.).
From its beginning, NOS research agrees that students (and teachers) hold inadequate and misinformed beliefs on NOS issues across different educational levels and contexts. Further, researchers agree that effective NOS teaching requires explicit and reflective methods to overcome the many learning barriers (Bennássarr et al., 2010;García et al., 2011;Cofré et al., 2019;Deng et al., 2011). These barriers relate to the basic processes of gathering (observation) and elaborating (analysis) data, decision-making in science, and specifically, the inability to differentiate facts and explanations and adequately coordinate evidence, justifications, arguments and conclusions; the lack of elementary meta-cognitive and self-regulation skills (i.e., the quick jump to conclusions as self-evident); the introduction of personal opinions, inferences, and reinterpretations and the dismissal of the counter-arguments or evidence that may contradict personal ideas (García-Mila & Andersen, 2008;McDonald & McRobbie, 2012).
As these barriers point directly to the general abilities involved in thinking (observation, analysis, answering questions, solving problems, decision-making and the like), researchers attribute those difficulties to the lack of the cognitive skills involved in the adequate management of the barriers, whose higher-order cognitive nature corresponds to many CRT skills (Kolstø, 2001;Zeidler et al., 2002). Thus, the solutions to overcome the barriers imply mastering the CRT skills, and, consequently, achieving successful NOS learning (Ford & Yore, 2014;McDonald & McRobbie, 2012;Simonneaux, 2014). Erduran and Kaya (2018) argue that the perennial aim of developing students' and teachers' NOS epistemic insights still remains a challenge for science education, despite decades of NOS research, due to the many aspects involved. They conclude that NOS knowledge critically demands higher-order cognitive skills. The paragraphs below elaborate on these higher-order cognitive skills or CRT skills.

Critical thinking
As previously stated, the CRT field shows many differences in scholarly knowledge on the conceptualization and composition of CRT. Ennis' (1996) simple definition of CRT as reasonable reflective thinking focused on deciding what to believe or do is likely the most celebrated definition among many others. A Delphi panel of experts defined CRT as an intentional and self-regulated judgment, which results in interpretation, analysis, evaluation and inference, as well as the explanation of the evidentiary, conceptual, methodological, criterial or contextual considerations on which that judgment is based (American Psychological Association1990).
However, the varied set of skills associated with CRT is controversial (Fisher, 2009). For instance, Ennis (2019) developed an extensive conception of CRT through a broad set of dispositions and abilities. Similarly, Madison (2004) proposed an extensive and comprehensive list of skills (Table 1).
The development of CRT tests has contributed to clarifying the relevance of the many CRT skills, as the test's functionality requires concentrating on a few skills. For instance, Halpern's (2010) questionnaire assesses, through everyday situations, problem-solving, verbal reasoning, probability and uncertainty, hypothesis-testing, argument analysis and decision-making. Watson and Glaser's (2002) instrument assesses deduction, recognition of assumptions, interpretation, inference, and evaluation of arguments. The California Critical Thinking Skills Test assesses analysis, evaluation, inference, deduction and induction (Facione et al., 1998). It is also worth mentioning that most CRT tests target adults, although the Cornell Critical Thinking Tests (Ennis & Millman, 2005) were developed for a variety of young people and address several CRT skills (X test, induction, deduction, credibility, and identification of assumptions; Class Test, classical logical reasoning from premises to conclusion, etc.). The large number of CRT skills led scholars to perform efforts of synthesis and refinement that are summarized through some exemplary proposals ( Table 1).
The CRT psychological framework presented above places the complex set of skills within the high-level cognitive constructs whose practice involves a selfdirected, self-disciplined, self-supervised, and self-corrective way of thinking that presupposes conscious mastery of skills and conformity with rigorous quality Table 1 Exemplary categories that synthesize the critical thinking skills from seminal authors Madison (2004) Ennis (2019) Paul & Nosich (2008) Manassero-Mas and Vázquez-Alonso (2019) Fisher (2021) Recognizing and clarifying problems, claims, arguments, and explanations standards. In addition to skills, CRT also involves effective communication and attitudinal commitment to intellectual standards to overcome the natural tendencies to fallacy and bias (self-centeredness and socio-centrism).

Science education and thinking skills
CRT skills mirror the scientific reasoning skills of scientific practice, and vice versa, based on their similar contents. This intuitive resemblance may launch expectations of their mutual relationship. Science education research has increased attention to CRT skills as promotors of meaningful learning, especially when involving NOS and understanding of socio-scientific issues (Vieira et al., 2011;Torres & Solbes, 2016;Vázquez-Alonso & Manassero-Mas, 2018;Yacoubian & Khishfe, 2018, among others). Furthermore, Yacoubian (2015) elaborated several reasons to consider CRT a fundamental pillar for NOS learning. Some authors stress the convergence between science and CRT based on the word critical, as thinking and science are both critical. Critical approaches have always been considered consubstantial to science (and likely a key factor of its success), as their range spreads from specific critical social issues (i.e., scientific controversies, social acceptance of scientific knowledge, social coping with a virus pandemic) to the socially organized scepticism of science (i.e., peer evaluation, scientific communication). The latter is considered a universal value of scientific practice to guarantee the validity of knowledge (Merton, 1968;Osborne, 2014). In the context of CRT research, the term critical involves normative ways to ensure the quality of good thinking, such as open-minded abilities and a disposition for relentless scrutiny of ideas, criteria for evaluating the goodness of thinking, adherence to the norms, standards of excellence, and avoidance of errors and fallacies (traits of poor thinking). These obviously also apply to scientific knowledge through peer evaluation practice, which represents a superlative form of good normative thinking (Bailin, 2002;Paul & Elder, 2008).
Another important feature of the convergence of CRT and science is the broad set of common skills sharing the same semantic content in both fields, despite that their names may seem different. Induction, deduction, abduction, and, in general, all kinds of argumentation skills, as well as problem-solving and decision-making, exemplify key tools of scientific practice to validate and defend ideas and develop controversies, discussions, and debates. Concurrently, they, too, are CRT skills (Sprod, 2014;Vieira et al., 2011;Yacoubian & Kishfe, 2018). In addition, Santos' (2017) review suggests the following tentative list of skills: observation, exploration, research, problem-solving, decision-making, information-gathering, critical questions, reliable knowledge-building, evaluation, rigorous checks, acceptance and rejection of hypotheses, clarification of meanings, and true conclusions. Beyond skill names and focusing on their semantic content, (Manassero-Mas & Vázquez-Alonso, 2020a) developed a deeper analysis of the skills usually attributed to scientific thinking and critical thinking, concluding that their constituent skills are deeply intertwined and much more coincident than different. This suggests that scientific and critical thinking may be considered equivalent concepts across the many shared skills they put into practice. However, equivalence does not mean identity, as important differences may still exist. For instance, the evaluation and judgment of ideas involved in organized scientific skepticism (i.e., peer evaluation) are much more demanding and deeper in scientific practice than in daily life thinking realms.
In sum, research on the CRT and NOS constructs is plural, as they draw from two different fields and traditions, general education and cognitive psychology, and science education, respectively. However, CRT and NOS share many skills, processes, and thinking strategies, as they both pursue the same general goal, namely, to establish the true value of knowledge claims. These shared features provide further reasons to investigate the possible relationships between NOS and CRT skills.

Research involving nature of science and thinking skills
The research involving both constructs is heterogeneous, as the operationalisations and methods are quite varied, given the pluralized nature of NOS and thinking. For example, Yang and Tsai (2012) reviewed 37 empirical studies on the relationship between personal epistemologies and science learning, concluding that research was heterogeneous along different NOS orientations: applications of Kuhn's (2012) evolutionary epistemic categories, use of general epistemic knowledge categories, studies on epistemological beliefs about science (empiricism, tentativeness, etc.), and applications of other epistemic frameworks. The studies dealing with the epistemological beliefs about science were a minority. Another example of heterogeneity comes from Koray and Köksal's (2009) study about the effect of laboratory instruction versus traditional teaching on creativity and logical thinking in prospective primary school teachers, where the laboratory group showed a significant effect in comparison to the traditional group. However, the NOS contents involved in laboratory instruction are still unclear. Dowd et al. (2018) examined the relationship between written scientific reasoning and eight specific CRT skills, finding that only three aspects of reasoning were significantly related to one skill (inference) and negatively to argument.
A series of studies suggest implicit relationships between NOS and thinking skills. Yang and Tsai (2010) interviewed sixth-graders to examine two uncertain science-related issues, finding that children who developed more complex (multiplistic) NOS knowledge displayed better reflective thinking and coordination of theory and evidence. Dogan et al. (2020) compared the impact of two epistemic-based methodologies (problem-based and history of science) on the creativity skills of prospective primary school teachers, finding that the problem-solving approach was more effective in increasing students' creative thinking. Khishfe (2012) and Khishfe et al. (2017) found no differences in decision-making and argumentation in socioscientific issues regarding NOS knowledge, but more participants in the treatment groups referred their post-decision-making factors to NOS than the other groups. Other studies found relationships between NOS understanding and variables that do not match CRT skills precisely. For instance, Bogdan (2020) found that inference and tentativeness relate to attitudes toward the role of science in social progress, but creativity does not, and the same applies to the acceptance of the evolution theory (Cofré et al., 2017;Sinatra et al., 2003).
Another set of studies comes from science education research on argumentation, which is based on the rationale that argumentation is a key scientific skill for validating knowledge in scientific practice. Thus, reasoning skills should be related to NOS understanding. Students who viewed science as dynamic and changeable were likely to develop more complex arguments (Stathopoulou & Vosnidou, 2007). In a floatation experience, Zeineddin and Abd-El-Khalick (2010) found that the stronger the epistemic commitments, the greater the quality of the scientific reasoning produced by the individuals. Accordingly, the term epistemic cognition of scientific argumentation has been coined, although specific research on argumentation and epistemic cognition is still relatively scarce (He et al., 2020).
Weinstock's (2006) review suggested that people's argumentation skills develop in proportion to their epistemic development, which Noroozi (2016) also confirmed. Further, Mason and Scirica (2006) studied the contribution of general epistemological comprehension to argumentation skills in two readings, finding that participants at the highest level of epistemic comprehension (evaluative) generated better quality arguments than participants at the previous multiplistic stage (Kuhn, 2012). In addition, the review of Rapanta et al. (2013) on argumentative competence proposed a three-dimensional hierarchical framework, where the highest level is epistemological (the ability to evaluate the relevance, sufficiency, and acceptability of arguments). Again, Henderson et al. (2018) discussed the key challenges of argumentation research and pointed to students' shifting epistemologies about what might count as a claim or evidence or what might make an argument persuasive or convincing, as well as developing valid and reliable assessments of argumentation. On the contrary, Yang et al. (2019) found no significant associations between general epistemic knowledge and the performance of scientific reasoning in a controversial case with undergraduates.
From science education, González-Howard and McNeill (2020) analysed middle-school classroom interactions in critique argumentation when an epistemic agency is incorporated, indicating that the development of students' epistemic agency shows multiple and conflating approaches to address the tensions inherent to critiquing practices and to fostering equitable learning environments. This idea is further developed in the special section on epistemic tools of Science Education (2020), which highlights the continual need to accommodate and adapt the epistemic tools and agencies of scientific practices within classrooms while taking into account teaching, engineering, sustainability, equity and justice (González-Howard & McNeill, 2020;Settlage & Southerland, 2020).
Finally, some of the above-mentioned research used a noteworthy concept of epistemic knowledge (EK) as "knowledge about knowledge and knowing" (Hofer & Pintrich, 1997), which has been developed in mainstream general education research and involves some meta-cognitions about human knowledge that research has largely connected to general learning and CRT skills (Greene et al., 2016). Obviously, EK and NOS knowledge share many common aspects (epistemic), suggesting a considerable overlap between them. However, it is noteworthy that NOS research is oriented toward CRT skills impacting NOS learning, while EK research orientates toward EK impacting CRT skills and general learning.
Regarding the Likert formats for research tools, test makers are concerned about the control of response biases that cause a lack of true reflection on the statement content and may damage the fidelity of data and correlations. Respondents' tendency to agree with statements (acquiescence bias) is widespread. Further, neutrality bias and polarity bias reflect respondents' propensity to choose fixed score points of the scale, either the midpoints (neutrality) or the extreme scores (polarity), either extreme high scores (positive bias) or extreme low scores (negative bias). To mitigate biases, experts recommend avoiding the exclusive use of positively worded statements within the instruments and combining positive and reversed items. This recommendation has been implemented here using three categories for NOS phrases that operationalize positive, intermediate and reversed statements (Vázquezr et al., 2006;Kreitchmann et al., 2019;Suárez-Alvarez et al., 2018;Vergara & Balluerka, 2000). However, the use of varied styles for phrases harms the instrument's reliability and validity, and reliability is underestimated (Suárez-Alvarez et al., 2018).
All in all, the theoretical framework is twofold: CRT and NOS research. The above-mentioned research shares the hypothesis that the relationship between NOS and CRT skills matters. However, it displays a broad heterogeneity of research methods, variables, instruments and mixed results on the NOS-CRT relationship that do not allow a common methodological standpoint. Further, mainstream research focuses on college students and argumentation skills. In this regard, this study aims to empirically research the NOS-CRT relationship by applying standardized assessment tools for both constructs. This promotes comparability among researchers and provides quick diagnostic tools for teachers. Secondly, this study addresses younger students, which involves the creation of NOS and CRT tools adapted to young participants, for which some test validity and reliability data are provided. The research questions within this framework are: Do NOS knowledge and CRT skills correlate? What are the traits and limits conditions of this relationship, if any?

Materials and methods
The data gathering took place in Spain in the year 2018. At this time, the enacted school curriculum missed the international standards and specific curriculum proposals about CRT and NOS issues, so NOS issues could be implicitly related to some curricular contents about scientific research. Despite this lack of curricular emphasis, the principals of the participant Spanish schools expressed interest in diagnosing students' thinking skills and NOS knowledge and agreed with the authors on the specific CRT and NOS-skills to be tested. As the Spanish school curriculum does not emphasize CRT and NOS issues, the students are expected to be equally trained, and this context conditioned the design of tentative tests through simple contents and an open-ended format, as they are cheap and easy to administer and interpret.

Participants
The participant schools (17) included some public (4) and state-funded private schools (13) that spread across mixed socio-cultural contexts and large, medium, and small Spanish townships. The participant students were tested in their natural school classes (29) of the two target grades. The valid convenience samples are two cohorts of students, each representing students of 6 th grade of Primary Education (PE6) (n = 434; 54.8% girls and 45.2% boys; mean age 11.3 years) and 8th grade of Secondary Compulsory Education (SCE8) (n = 347; 48.5% girls and 51.5% boys; mean age 13.3 years). In Spain, 6 th grade is the last year of the primary stage (11-12-year-old students), and the 8 th grade is the second year of the lower secondary compulsory stage (13-14-year-old students).

Instruments
Two assessment tools were tailored by researchers (a CRT skill test and a NOS scenario) to operationalise CRT and NOS to empirically check their relationships. As the Spanish school curriculum lacks CRT standards, the specific thinking skills that represent the CRT construct were agreed upon between principals and researchers. The design of the tool to assess NOS knowledge took into account that NOS was not explicitly taught in Spanish schools. Both tools were designed to match the schools' interests and the students' developmental level; the latter particularly led to choosing a simple NOS issue (definition of science) to match the primary students' capabilities better.

Thinking challenge tests
Two CRT thinking skill test were developed for the two participant cohorts (PE6 and SCE8). The design aligns with the tradition of most CRT standardised tests that concentrate assessment on a few selected thinking skills (i.e., Ennis & Millman, 2005;Halpern, 2010). The test for the 6th-graders (PE6) assesses five skills: prediction, comparison and contrast, classification, problem-solving and logical reasoning. The test for the 8th-graders (SCE8) assesses causal explanation, decision-making, parts-all relationships, sequence and logical reasoning.
As most CRT tests are designed for adults, many tests and item pools were reviewed to select suitable items for younger students. The selection criteria were the fit of the items' cognitive demand with students' age, the addressed skill and the motivational challenge for students. Moreover, items must be readable, understandable, adequate, and interesting for the participant students. Then, two 45-item and 38-item tests were agreed on and piloted. Their results are described elsewhere (Manassero-Mas & Vázquez-Alonso, 2020b). The items were examined by the authors according to their reliability, correlation and factor analysis to eliminate unfair items. Again, the former criteria were used to add new items to conform the two new 35-item Thinking Challenge Tests (TCT) to assess the CRT skills of this study.
The items of the first two skills were drawn from the Cornell (Nicoma) test, which evaluates four CRT skills through the information provided by a fictional story about some explorers of the Nicoma planet and asks questions about the story. Some items from prediction and comparison skills were drawn for the 6th-grade TCT (PE6), and some items from causal explanation and decision-making skills were drawn for the 8th-grade TCT (SCE8). The two TCT include three additional items on logical reasoning that were selected from the 78-item Class-Reasoning Cornell Test (Ennis & Millman, 2005). One item was also drawn from the 25-situation Halpern CRT test (Halpern, 2010) for the problem-solving skill of the PE6 test. The authors adapted the remaining figurative items (Table 2) to enhance students' challenge, understanding, and motivation and make the TCT free of school knowledge (Appendix).
Overall, the TCT items pose authentic culture-free challenges, as their contents and cognitive demands are not related to or anchored in any prior school curricular knowledge, especially language and mathematics. Therefore, the TCT are intended to assess culture-free thinking skills.
The item formats involve multiple-choice and Likert scales with appropriate ranges and rubrics that facilitate quick and objective scoring and the elaboration of increasing adjustment between items' cognitive demand and their corresponding skill, thereby leading to further revision based on validity and reliability improvement. This format also allows setting standardised baselines for hypothesis-testing through comparisons of research, educational programs, and teaching methodologies.

Nature of science assessment
A scenario on science definitions is used to assess the participants' NOS understanding because this simple issue may better fit the lack of explicit NOS teaching and the developmental stage of the young students, especially the youngest 6th-graders.
The scenario provides nine phrases that convey an epistemic, plural and varied range of science definitions, and respondents rate their agreement-disagreement with the phrases on a 9-point Likert scale (1 = strongly disagree, 9 = strongly agree) to allow better nuancing of their NOS beliefs and avoid psychometric objections to the scale intervals. The scenario is drawn from the "Views on Science-Technology-Society" (VOSTS) pool that Aikenhead and Ryan (1992) developed empirically by synthesizing many students' interviews and open answers into some scenarios, written in simple, understandable, and non-technical language. They consider that VOSTS items have intrinsic validity due to their empirical development, as the scenario phrases come from students, not from researchers or a particular philosophy, thus avoiding the immaculate perception bias and ensuring students' understanding. Lederman et al. (1998) also consider VOSTS a valid and reliable tool for investigating NOS conceptions. Manassero et al. (2003) adapted the scenarios into the Spanish language and contexts, and developed a multiple-rating assessment rubric, based on the phrase scaling achieved through expert judges' consensus. The rubric assigns indices whose empirical reliability has been presented elsewhere (Vázquezr et al., 2006;Bennássar et al., 2010).

Procedure
The students completed the two tests through digital devices led by their teachers within their natural school classroom groups during 2018-19. To enhance students' effort and motivation, the applications were infused into curricular learning activities, where students were encouraged to ask about problems and difficulties. During applications students did not ask questions to teacher that may reflect some difficulty to understand the tests. The database was processed with SPSS 25 and Factor program (Baglin, 2014) for exploratory and confirmatory factor analysis through polychoric correlations and Robust Unweighted Least Squares (RULS) method that lessen conditions on the score distribution of variables. Effect size statistics use a cut-off point (d = 0.30) to discriminate relevant differences.

Thinking challenge tests
There was no time limit for students to complete the tests, and the applications took between 25 and 50 min. Correct answers score one point, incorrect answers zero points, and no random corrections were applied. The skill scores were computed by adding the scores of the items that belong to each skill, which are independent. The addition of the five skill scores makes up a test score (thinking total) that estimates students' global CRT competence and is dependent on the skill scores ( Table 2). The different types of validity maintain a reciprocal influence and represent the various parts of a whole, so they are not mutually independent. The Thinking Challenge tests' validity relies on the quality of the CRT pools and tests examined by the authors, their agreement to choose the items that best matched the criteria, and the reviewed pilot results (Manassero-Mas & Vázquez-Alonso, 2020b). The Factor program computes several reliability statistics (Cronbach alpha, EAP, Omega, etc.).

Nature of science scenario
The nine phrases describe different science definitions, and students rated each one on a 1-9 agreement scale. According to the experts' current views on NOS, a panel of qualified judges reached a 2/3-consensus to categorize each phrase within a 3-level scheme (Adequate, Plausible, Naive), which has been widely used in NOS assessment (Khishfe, 2012;Liang et al., 2008;Rubba et al., 1996). The scheme means the phrases express informed (Adequate), partially informed (Plausible), or uninformed (Naive) NOS knowledge (see Appendix). According to this scheme, an evaluation rubric transforms the students' direct ratings (1-9) into an index [− 1 to + 1], which is proportionally higher when the person agrees with an Adequate phrase, partially agrees with a Plausible phrase, or disagrees with a Naive phrase. All the rubric indices balance positive and negative scores, which are symmetrical for Adequate and Naïve phrases, but plausible indices are somewhat loaded toward agreement, as higher agreement would be expected. The index unifies the NOS measurements to make them homogeneous (positive indices mean informed conceptions), invariant (measurement independent of scenario/phrase/category), and standardised (all measures within the same interval [− 1, + 1]). The index proportionally values the adjustment of students' NOS knowledge to the current views of science: the higher (or lower) the index, the better (or worse) informed is their NOS knowledge (Vázquezr et al., 2006).
Three category variables (Adequate, Plausible, and Naïve) are computed by averaging their phrase indices, which are mutually independent. The average of the three category variables computes a global NOS index representing the student's overall NOS knowledge (Global). The use of three categories aligns with test makers' recommendations to avoid using only positively worded phrases in order to elude the acquiescence bias, which harms reliability and validity (Suárez-Alvarez et al., 2018).
The links between thinking skills and NOS are empirically explored through correlational methods and one-way ANOVA procedures of the variables of the Thinking Challenge test and science definitions.

Results
The results include the descriptive statistics of the target variables, twelve thinking variables (five skills plus thinking total for each group) and four variables of the science definitions (adequate, plausible, naive, and global), the analysis of the correlations, a linear regression analysis among these variables, and a comparison of thinking skills between NOS categorical groups through a one-way ANOVA.

Descriptive statistics
Most mean thinking variables scores fell near the midpoint of the scale range. Four skills (classification, problem-solving, causal explanation and sequence) scored above the midpoints of their ranges, whereas two variables (logical reasoning and decision) scored slightly below their midpoints. Overall, these results indicate the medium difficulty of the tests for the students, neither easy nor difficult, which means the CRT tests can be acceptable to assess young students' thinking skills ( Table 3).
The EAP reliability indices of classification, problem-solving, sequence, parts (mainly figurative items) and thinking scales were excellent, good for the remaining scales, but poor for logical reasoning. Low reliability indicates a need for item revision and limited applicability (i.e., inappropriate for individual diagnosis), but is insufficient to reject the test in research purposes (U.S. Department of Labor, 1999). As test reliability critically depends on the number of items, increasing the length of logical reasoning over its three current items will improve its reliability.
The descriptive results for the direct scores of the NOS variables (Table 4) showed a biased pattern toward agreement (average phrases between 4.9 and 7.4), which suggests some acquiescence bias in spite of presenting varied phrases. The average indices obtained positive scores for the adequate category, slightly negative ones for the naïve category, and close-to-zero for the plausible phrases (the effect size of the differences concerning a zero score was low). The overall weighted average index for the whole sample (global variable) was close-to-zero and slightly positive, meaning that the students' overall epistemic conception of science definition was not significantly informed. The overall average index of Adequate phrases obtained the highest positive score for both samples of students, which means that most students agreed with the Adequate phrases (expressing informed beliefs about science). In contrast, the Naïve overall average index obtained the lowest negative mean score, indicating that the students agreed instead of disagreeing with phrases expressing uninformed views about science. The Plausible variable (phrases expressing partially informed beliefs, neither adequate nor naive) obtained a closeto-zero average score, meaning that the students' beliefs about these variables were far from informed. Overall, the students presented slightly informed views on Adequate phrases, close-to-zero average indices scores (not informed views) for Plausible phrases and slightly uninformed views on Naive statements. Polychoric correlations among NOS direct scores computed through Factor attained good scores on all NOS items, indicating a unidimensional structure (but Phrase I). The exploratory factor analysis (EFA) applied to phrase scores displayed a dominant eigenvalue, whose general factor had acceptable loadings for all phrases (only phrase I had low loading). The unidimensional model obtained fair statistics in the confirmatory factor analysis. These results suggest one general factor underlying students' scores and justify a global score representing the variance of all the NOS phrases. The expected a posteriori (EAP) reliability scores for the entire NOS scale were good (Table 4).
The comparison of NOS scores between primary and secondary grades highlights that the four NOS variable scores on science definitions were significantly equal for both cohorts of students, despite the two years separation. So, the educational impact of the two-year period on NOS seems almost null, given the close-to-zero differences in science definitions. This result could be expected, as NOS is not explicitly planned in Spanish science curricula and is not usually taught in the classroom.
Both cohorts answered the same anchoring CRT item (see Appendix), whose correct answer rate (27% primary; 33% secondary) suggests a slight improvement in CRT skills that sharply contrasts with the former NOS comparison. Summing up, despite that CRT and NOS have not been taught to Spanish students, developmental learning may increase CRT skills but not improve NOS knowledge. This reinforces the claim for explicit and reflective teaching of NOS, as implicit developmental maturation alone seems ineffective.

Correlations between nature of science and thinking skills
The empirical analysis of the hypothesised relationships between thinking skills and NOS epistemic variables (Adequate, Plausible, Naive) was performed through correlational methods (Pearson's bivariate correlation coefficients and linear regression analysis) and one-way analysis of variance.
The Pearson correlation coefficients revealed a pattern of the relationships between NOS and thinking skills (Table 5): all thinking skills positively correlated with the Adequate variable, and most were significant, except for prediction and logical reasoning in EP6, which were non-significant. However, the correlations with the Naive and Plausible variables were overall non-significant. However, there were some exceptions: first, the Plausible/problem-solving correlation in EP6 was significant (and negative); second, the correlations between Naïve and logical reasoning (positive in EP6) and also between decision-making, logical reasoning and the thinking total score (negative in SCE8) were significant.
Thus, the noteworthy pattern for the NOS-CRT relationship showed that the Adequate variable positively correlated with all the thinking variables and was mostly statistically significant (83%); the highest positive correlations corresponded to problem-solving (EP6), sequence and parts-all (ES8), and the thinking total skills for both groups (p < 0.01). This pattern means that students with higher (lower) thinking skill scores expressed higher (lower) agreement with Adequate phrases.
The correlation pattern between thinking skills and the Plausible and Naive variables was mainly non-significant (75%). Only two correlations were significant in the EP6 group; the Plausible-problem-solving correlation was negative (higher scorers on problem-solving did not recognize the intermediate value of Plausible science definitions), whereas the Naïve-logical reasoning correlation was positive (higher scorers on logical reasoning tended to disagree with Naive science definitions). Three Naïve correlations were significant and negative in the secondary group (SCE8): parts-all, logical reasoning skills and thinking total.
Overall, the positive and significant correlation pattern of the Adequate variable was stronger than the mainly non-significant and somewhat negative Naive and Plausible correlation pattern.

Linear regression analysis between nature of science and thinking skills
Regression analysis (RA) compares the power of a set of variables to predict a dependent variable and the common variance. Two linear regression analyses were carried out to test the mutual contribution of the CRT and NOS variables. The first RA uses the NOS variables (Adequate, Plausible, Naive and Global) as the dependent variables, and the five independent thinking skills as predictors ( Table 6). The second RA (Table 7) reversed the roles of the variables, thus establishing the thinking skills as the dependent variables and the three independent NOS variables (Adequate, Plausible and Naive) as the predictors. Collinearity tests were negative for all RAs through tolerance, variance inflation factor and condition index statistics.   The first RA (Table 6) showed that the NOS Adequate variable achieved the highest proportion of common variance with thinking skill predictors at both educational levels (4.2% in PE6 and 9.2% in SCE8), whereas the other two NOS variables achieved much lower levels of explained variance. In PE6, the most significant predictor skill of NOS was problem-solving, whereas the other predictor skills did not reach statistical significance in any case. In SCE8, the most significant predictors were three skills (sequencing, reasoning, and parts-all), whereas the remaining skills did not reach statistical significance (the predictors of the Plausible variable were negative).
The second RA (Table 7) showed that the Adequate variable achieved the greatest predictive power, as most thinking skills displayed statistically significant standardised beta coefficients at the two educational levels, while Plausible and Naïve variables had a much lower predictive power, and Plausible standardised coefficients were non-significant for any skill predictor. The common variance displayed a similar amount to the first analysis; the thinking total variable displayed the largest variance at both educational levels (4.8% PE6; 9.6% SCE8), and the problem-solving skills at PE6 (5.3%) and parts-all at SCE8 (7.1%).
In summary, the Adequate variable and the classification and problem-solving skills (PE6) and sequencing and parts-all skills (SCE8) were the variables that presented the largest standardised coefficients and statistical significance regarding the research question raised in this study about the positive relationship between NOS and thinking skills.

Analysis of variance between nature of science and thinking skills
Further exploration of the NOS-skills relationship was conducted through one-way between-groups analysis of variance. According to performance on the Adequate, Plausible and Naive variables, the participants were allocated to four percentile groups (low group: 0-25%; medium-low: 25-50%; medium-high: 50-75%; high: 75-100%), which made up the independent variable of the ANOVA for testing the differences in thinking skills (dependent variable) among these four groups.
The Adequate groups yielded a statistically significant main effect for the thinking total in primary [F(3, 429) = 7.745, p = 0.000] and secondary education [F(3, 343) = 2.607, p = 0.052]. The effect size of the differences in the thinking total scores between the high and low groups was large for the primary (d = 0.69) and secondary (d = 0.86) cohorts. Furthermore, comparison, classification, and problem-solving skills also replicated this pattern of large differences between high-low groups that supports the NOS/CRT positive relationship. However, prediction (p = 0.069) and logical reasoning (p = 0.504) did not display differences among the Adequate groups.
Post-hoc comparisons (Scheffé test) showed that the low group achieved significantly lower scores than the other three Adequate groups. The Adequate low group scores on thinking total, comparison, classification, and problem-solving skills were significantly lower than the scores of the other three groups, whereas the differences among the Adequate groups on prediction and logical reasoning scores were non-significant.
The main effect of the Plausible groups on the thinking total variable did not reach statistical significance for the primary F(3, 430) = 1.805, p = 0.145] and secondary groups [F(3, 343) = 2.607, p = 0.052]. The effect size was small (d = − 0.31 primary; d = − 0.32 secondary) and negative (the thinking total mean score of the low group was higher than that of the high group). Post-hoc comparisons (Scheffé test) confirmed the trend, as they did not yield significant differences among the Plausible groups, although the mean score of the Plausible high group was lower than the other three groups. Exceptionally, problem-solving skill (primary) displayed a statistically significant difference between the Plausible high group (the lowest mean score) and the remaining three groups.
The main effect of Naive groups on the thinking total variable did not reach statistical significance [F(3, 430) = 1.075, p = 0.367 primary; F(3, 343) = 1.642, p = 0.179 secondary] and the effect size of the differences was small (d = 0.32 primary; d = − 0.31 secondary). The opposite direction of the differences in primary (positive) and secondary education (negative) is noteworthy, as it means that the highest mean score corresponded to the Naive high group in primary (positive) or the Naive low group in secondary (negative). Post-hoc comparisons (Scheffé test) showed that there were no significant differences among the Naive groups. However, the league table of groups across the Naive groups revealed differences between primary and secondary cohorts. Overall, the primary Naive groups followed the pattern of the Adequate variable (the low group displayed the lowest score), whereas the secondary Naive groups followed the pattern of the Plausible variable (the high group tended to display the lowest score).

Discussion
The empirical findings of this study quantify through correlations some significant and positive relationships between thinking skills and NOS beliefs about science definitions, as the main answer to the research question. However, the analysis shows a complex pattern of the relationship, which depends on the kind of the NOS variable under consideration: the NOS Adequate variable, which represents phrases expressing informed views on science, is positively and significantly related to most thinking skills, whereas the uninformed Naive and intermediate Plausible variables show a lower predictive power of thinking skills. Summing up, the positive significant CRT-NOS relationship is not displayed by all NOS variables, as it is limited to those NOS variables that express an Adequate view of science, while the other NOS variables do not significantly correlate with CRT skills.
The implications of this study for research are twofold. On the one hand, the variables of this study specifically operationalise the two constructs under investigation, namely, CRT skills and NOS knowledge, which has been a challenge throughout their mixed operationalisation in the reviewed research. On the other hand, via Pearson correlations and regression analysis, this study quantifies the amount of the common variance between specific CRT skills and specific NOS knowledge, which is significant in many cases. Both contributions improve the features of previous studies, as most of them investigated the relationship from varied methodological frameworks: some reported group comparison, fewer analysed correlations, and most of the latter used a diversity of variables, which often did not match either CRT skills or NOS variables. For instance, Vieira et al. (2011) correlated thinking skills with science literacy (not NOS) and reported Pearson correlations that were lower than the correlations obtained herein, even though they used a smaller sample, which favours higher correlations.
The findings reveal the complexity of the NOS-CRT relationship, which limits the positive and relevant relationship to the NOS Adequate variables about science definitions, but not to the Plausible or Naive conceptualizations, which mainly display non-significant and somewhat negative correlations. The positive relationship between thinking and Adequate science definitions is a remarkable finding, which empirically supports the hypothesis that better thinking skills involve better NOS knowledge and confirms the concomitant intuitions and claims of some studies about the importance of thinking skills for learning NOS epistemic topics (Erduran & Kaya, 2018;Ford & Yore, 2014;Simonneaux, 2014;Torres & Solbes, 2016;Yacoubian, 2015). The findings also contribute to establishing the limit of the significant relationship, which applies when the NOS is conveyed by informed statements (Adequate phrases) and does not apply for non-adequate NOS statements, which are a minority in the face of most NOS literature, which conveys informed statements on NOS (Cofré et al., 2019).
The implications of the collateral finding on the lack of differences in science definitions between primary and secondary cohorts deserve further comments. Obviously, the finding confirms that two educational years have a scarce impact on improving Spanish students' understanding of science definitions; that is, NOS teaching seems ineffective and stagnated, probably due to poor curriculum development and the lack of teacher training and educational resources. Besides, the students' higher performance on adequate phrases than on plausible and naïve phrases also suggests that Spanish students may achieve some mild knowledge about the informed traits of science because they are implicitly displayed in teaching, textbooks and media. However, plausible and naïve knowledge is not usually available from those sources, as it requires explicit and reflective teaching, which Spanish students usually lack. Both findings suggest the need for further attention to misinformed NOS knowledge to invigorate explicit and reflective NOS teaching (Cofré et al., 2019;McDonald & McRobbie, 2012).
The unexpected non-significant/negative relationships between thinking and Plausible and Naive variables may need some elaboration due to the complexity of students' NOS conceptions. For instance, Bennássar et al. (2010) described the students' inconsistent agreements when rating opposite statements. Bogdan (2020) found that epistemic conceptions of science creativity did not relate to attitudes to science, and Khishfe (2012) reported complex relationships between epistemic aspects of science and decision-making about genetically modified organisms or the acceptance of the evolution theory (Cofré et al., 2017;Sinatra, et al., 2003). Thus, a tentative interpretation of those paradoxical relationships is elaborated.
Higher-thinking-skill students might develop better quality reflections that elicit more confident and higher scores on NOS phrases than lower-thinking-skill students. The latter tend toward less confident and low-quality reflection, which may elicit intermediate, less polarized scores. On average, this differential pattern explains the complex pattern of relationships between CRT and NOS variables. For the Adequate phrases (where the rubric assigns the best indices to the highest scores), higher-thinking students will achieve higher NOS indices than lower-thinking students, explaining the observed positive CRT-NOS correlations in the Adequate variables and the ANOVA results. On the other hand, when Naive and, especially, Plausible phrases are involved (which obtain their highest indices at low and intermediate scores, respectively), the differential response pattern would lead the lowerthinking students to achieve higher NOS indices than the higher-thinking students, thus shifting to the observed non-significant or negative correlations for Naive and Plausible phrases. In short, unconfident/confident and lower/higher quality reflection on NOS knowledge of the lower-/higher-thinking students would explain the shift from the positive and significant relationship of CRT-Adequate phrases to the nonsignificant correlations of Plausible and Naive phrases. This interpretation agrees with the striking finding of O'Brien et al. (2021) about a similar unexpected higher adherence to pseudoscientific claims in students with higher trust in science, which the authors attributed to the acritical acceptation of any scientific contents. Similarly, mastery of CRT skills is a desirable learning outcome, but it may make master students vulnerable to positive polarization in science definitions. However, further research is needed to confirm the non-significant correlations and the interpretation of the differential response pattern.
As the previous reference suggests, the findings about the complex CRT-NOS relationship connect with some pending controversies about NOS teaching, namely, the marginalized attention paid to misinformed ideas or myths about science, in favour of the informed ideas, which reveal implicit and non-reflective NOS teaching, as obviously misinformed ideas contribute to triggering more reflection than informed ideas (Acevedo et al., 2007;McComas, 1996). The effect of this underexposure is students' under-training about misinformed NOS ideas, which may act as obstacles to authentic NOS epistemic learning, explaining the differences presented herein. The remedy to this situation and the unconfident bias may lie in devoting more time and explicit attention to uninformed or incomplete NOS claims through reflective teaching.
This study is determined and limited by the contextual conditions of its correlational methodology. First, the research question implied measurements of thinking skills and NOS knowledge; second, the young participants (12-14-year-olds) required measurement tools appropriate to this age; third, the thinking skill tests had to match the thinking skills demanded by the participant school; fourth, the selected NOS tool was conditioned by the students' age and the lack of appropriate NOS assessment tools. Thus, further suggestions to overcome these limitations are focused on expanding empirical support for the NOS-CRT relationship. On the one hand, some new NOS issues, such as additional epistemological and social aspects of science, should be explored to extend the representativeness of NOS knowledge. Similar reflections apply to including new skills to expand the scope of the CRT tool. Furthermore, the number of items of the logical reasoning scale should be increased to improve its reliability. Overall, the perennial debate between open-ended and closed formats is also noteworthy for future research, as quantitative methods could be complemented with qualitative methods (such as students' interviews and the like).
Finally, the main educational implication of this study is that students may need to master some competence in CRT skills to learn NOS knowledge or general epistemic knowledge. Conversely, mastery of CRT skills may foster learning NOS knowledge. Although this study focuses on epistemic NOS knowledge drawn from science education, educational research has parallelly elaborated the epistemic knowledge (EK) construct for general education (Hofer & Pintrich, 1997), which opens further prospective research developments for NOS comprehension and CRT skills. On the one hand, the study of the NOS-EK relationship may shed light on convergent epistemic teaching and learning, both in science and in general education. On the other hand, the importance of CRT skills for NOS, and vice versa, may help coordinate teaching NOS-EK issues (Erduran & Kaya, 2018;Ford & Yore, 2014;McDonald & McRobbie, 2012;Simonneaux, 2014). This joint prospective of NOS-EK elaboration may also provide new answers to two aspects: the mutual connections between CRT skills and NOS-EK issues and the EK assessment tools that may also contribute to advancing the evaluation of CRT skills and NOS.