1 Introduction

Experimental research indicates that intuitive but often misleading conceptions from childhood persist after schooling and continue to co-exist with scientifically correct concepts within individuals (Shtulman & Legare, 2020; Vosniadou et al., 2018). This is visible in increased reaction times and error rates when evaluating the truth-value of statements for which scientific concepts are incongruent with intuitive conceptions (Shtulman & Valcarcel, 2012).

A prominent theoretical explanation for observed increases in reaction times and error rates is that interference between co-activated intuitive conceptions and scientific concepts is resolved through inhibition (Vosniadou et al., 2018). Individuals presumably draw on their inhibition ability to suppress intuitive conceptions and thereby resolve the interference (Shtulman & Valcarcel, 2012). Consequently, the magnitude of reaction time and error rate differences in the paradigm by Shtulman & Valcarcel (2012) and similar ones (e.g., Babai et al., 2014; Vosniadou et al., 2018) should be inversely related to inhibition ability: The better an individual’s inhibition, the better they should be able in resolving the occurring interference quickly and without errors (although we will note later that the exact relation with response times is a matter of debate).

A few studies that have examined this relation have found some positive results but also inconsistencies (Stricker et al., 2021; Vosniadou et al., 2018). Most relations appear moderate at best (Babai et al., 2014; Stricker et al., 2021); they are usually found with error rates but not with reaction times (Stricker et al., 2021), and they differ depending on the inhibition task (Stricker et al., 2021).

In the present study, we examine a prerequisite for establishing correlations between individual differences on tasks evoking interference between intuitive conceptions and scientific concepts and individual differences on tasks evoking inhibition. Namely, to produce reliable correlations, both kinds of tasks have to evoke stable (i.e., reliable) individual differences. This has recently been questioned for tasks evoking inhibition, finding that many tasks assessing this construct produce little individual differences between persons (e.g., Enkavi et al., 2019).

In this study, we examine whether the same is true for tasks evoking interference between intuitive conceptions and scientific concepts. Specifically, we examine whether, and to which extent, the statement-verification task, one paradigmatic approach that is used to evoke conceptual interference (Shtulman & Valcarcel, 2012; Shtulman & Legare, 2020), produces stable individual differences.

The statement-verification task usually asks individuals to evaluate statements from different scientific topics (e.g., mechanics and genetics) and domains (e.g., Physics and Biology). We employ internal consistency analysis, and confirmatory as well as exploratory factor analysis, on data from the statement-verification task by Shtulman & Valcarcel (2012). We use this approach to examine whether the statement-verification paradigm induces stable individual differences between individuals, which might be a prerequisite for finding correlations with other tasks. In addition, such analysis might help informing theories about the processes that are triggered by tasks activating multiple concepts that co-exist within learners.

1.1 The Co-existence of Intuitive Conceptions and Scientific Concepts

Some views on conceptual change in Science learning portrayed conceptual change as a process during which the initial conceptions that learners bring into the classroom are replaced with new knowledge structures that represent the scientifically accepted concept (Duit & Treagust, 2012). For example, Thagard (1993) emphasized conceptual development during which a whole system of concepts and rules is replaced by a new system. Posner et al. (1982) described conceptual change as replacement or reorganization of central concepts (see also McCloskey, 1982). Carey (1988) described conceptual change as processes during which either a novice misconception is being replaced by more expert beliefs or as change in core concepts themselves that involves these core concepts being replaced by a newer version (see also Gentner et al., 1997). Based on these views, instructional principles such as the induction of cognitive conflict have been proposed to make learners aware of the limitations of their intuitive conceptions and trigger them to replace their conceptions (Potvin, 2023).

Other theories described conceptual change as enrichment or as a reinterpretation of one’s knowledge (Duit & Treagust, 2012). For example, Vosniadou et al. (2008) described conceptual change as enrichment of one’s knowledge structure during which typically new aspects are added to the initial conception to develop synthetic conceptions that evolve more and more towards the scientifically accepted concept. Ohlsson (2009) portrayed conceptual change as a process during which new conceptions are added to one’s knowledge base and learners evaluate and compare the utility of the initial and newly acquired conceptions. These views acknowledge that the full replacement and extinction of intuitive conceptions are not only impossible, but also undesirable, as children’s ideas serve well for example for explaining phenomena in everyday life (Duit & Treagust, 2012). Based on these perspectives, the focus of science education switches from eradication of intuitive conceptions to emphasizing the utility of different conceptions depending on the context (Duit & Treagust, 2012).

One issue for the latter views used to be the difficulty of showing convincingly that intuitive conceptions persist generally (not just in some cases) and can be re-activated after scientific conceptions have been acquired. This changed in the later 2000s to early 2010s, when experimental paradigms were introduced that arguably allowed triggering and demonstrating the persistent presence of intuitive conceptions (Babai & Amsterdamer, 2008; Babai et al., 2010: Shtulman & Valcarcel, 2012). These paradigms indicated that intuitive but often misleading conceptions from childhood persist after we have acquired scientifically correct concepts in science education. Seminal evidence in this regard was gathered via a statement-verification paradigm requiring learners to evaluate the truth value of statements relating to scientific phenomena under moderate time pressure (Shtulman & Valcarcel, 2012). For example, when an adult is asked whether coats produce heat, they will think about it and usually answer “No! Coats conserve body heat!” Children commonly hold the intuitive conception that coats do indeed produce heat, since putting on a coat appears to produce warmth that has not been there before. In a seminal experiment, Shtulman & Valcarcel (2012) showed that for this kind of question, specifically, for which where the intuitive, childish concept would say “yes” but the scientific concept would say “no” (another example: “being cold can make a person sick”), reaction times are longer, and more errors occur than for questions where intuitive and scientific concept agree (e.g., “ovens produce heat”; “being sneezed on can make a person sick”). Specifically, within each concept (e.g., thermodynamics), reaction time and error rate differences were computed between congruent and incongruent statements, which were generally positive, indicating increased reaction times and error rates under incongruency. They showed this phenomenon across many domains and topics that are part of regular high school science curricula. This phenomenon has been replicated in with this and similar paradigms across many topics and domains (e.g., Barlev et al., 2017; DeWolf & Vosniadou, 2015; Potvin & Cyr, 2017; Stricker et al., 2021).

This evidence purportedly supports the assumption that intuitive conceptions continue to exist after we have acquired scientific concepts during schooling. By comparing response times and error rates of congruent and incongruent statements, the resulting differences presumably indicate the effects of interference between the intuitive conception and the scientific concept for the latter ones. The implied presumption of cognitive-representational pluralism, that is, the co-existence of intuitive conceptions and scientific concepts accounting relating to the same scientific phenomena within learners, has been readily integrated in theories of conceptual development in learners (Bélanger et al., 2023; Vosniadou, 2019).

1.2 Relations of Conceptual Interference with Inhibition Ability

A key question that arises when accepting the co-existence view is how learners deal with multiple conceptions that may be in conflict and thus interfere in learning situations or decision making. Some educational researchers believe that a key to handling interference is inhibition, the ability to suppress or ignore ongoing irrelevant thoughts and actions to achieve current goals (Brault Foisy et al., 2021; Coulanges et al., 2021; Dempster & Corkill, 1999; Mason & Zaccoletti, 2021; Vosniadou, 2014; Vosniadou et al., 2018). The underlying assumption is that individuals with better inhibition ability are generally better in managing cognitive interference, resulting in higher likelihood that they can suppress their intuitive conceptions and go with the correct answer indicated by the scientific concept, as well as being quicker in handling the interference (Shtulman & Valcarcel, 2012; Vosniadou et al., 2018). Identifying such a key process in handling interference would be a major step in understanding how individuals manage to overcome interference between intuitive conceptions and scientific concepts. Some researchers perceive to take the role of inhibition in managing multiple conflicting conceptions as granted, discussing evidence in favor of this presupposition and describing it as a central ability in this regard (e.g., Potvin, 2023; Potvin & Cyr, 2017; Vosniadou et al., 2018). In my view, the evidence has been overstated and we need to re-evaluate the theoretical and empirical basis for such claims.

Some studies have tried to test the involvement of inhibition in handling conceptual interference by relating individual differences in the statement-verification task and similar tasks to individual differences in tasks requiring inhibition (Babai et al., 2014; Stricker et al., 2021; Vosniadou et al., 2018). So far, a few studies indicate a moderate statistical relationship of individuals’ level of inhibition ability with their ability to handle tasks in which interference between intuitive conceptions and scientific concepts occurs (e.g., Vosniadou et al., 2018). However, the evidence is inconsistent.

Babai et al. (2014) posed a digit cancellation test on ninth-graders and related achievement on this task to achievement on a perimeters task with intuitive and counterintuitive trials. Although the authors found a correlation indicating that in learners with better achievement on digit cancellation effects of interference were diminished, the digit cancellation test it not a test of inhibition but of sustained attention, undermining strong interpretations (Lezak et al., 2004). Vosniadou et al. (2018) found a similar correlation, indicating diminished effects of interference for third- and fifth-graders with better performance on a Stroop-task. The authors only analyzed incongruent Stroop trials instead of the typical approach of analyzing differences between congruent and incongruent trials, again undermining interpretations (Draheim et al., 2019). Stricker et al. (2021) examined the relation of inhibition as measured by a picture-word task with reaction time and error rate differences in the statement-verification task by Shtulman & Valcarcel (2012) employing mathematics statements. The authors could not find a relation with reaction times and a small and non-significant relation with error rates.

Overall, only one study appears to be reliably interpretable (Stricker et al., 2021), and that study found relation between inhibition and interference-effects on the statement-verification task to be negligible. In stark contrast to further neurological (Brault Foisy et al., 2015) and experimental (Babai et al., 2012) literature documenting a role of inhibition in handling conceptual interference, the overall correlational evidence using measures of individual differences appears weak and inconclusive. This raises the question which aspects of inhibition tasks and tasks evoking conceptual interference might contribute to this inconsistent picture.

1.3 Individual Differences in Tasks of Attentional Control

Inhibition is sometimes accounted under the broader umbrella term of attentional control (Schubert et al., 2022). Although the label inhibition is often used as if it referred to a unitary psychological process or skill (e.g., Brault Foisy et al., 2015; Vosniadou et al., 2018), psychometric research has found that modeling inhibition consisting of a unitary or even of multiple correlated constructs that explain variation on multiple inhibition-related tasks is often impossible (Rey-Mermet et al., 2018). Relatedly, recent results that have brought up many issues and advances in research on attentional control are (a) a lack of correlations between different tasks meant to measure inhibition and broader attentional control and (b) a lack of reliable individual differences in tasks of inhibition and broader attentional control (Enkvai et al., 2019; Rey-Mermet et al., 2018; Rouder & Haaf, 2019).

The first result stems from a study by Rey-Mermet et al. (2018) in which in younger and older adults, the authors were unable to find strong psychometric structure across 11 measures of inhibition. The tasks showed little correlations with one another, resulting in weak explanatory value of latent factor models that try to trace back correlations between measures to a smaller number of latent constructs. Similarly, Rouder & Haaf (2019) found a lack of correlation between a Stroop task and a Flanker task, two paradigmatic tasks meant to measure inhibitory control.

In line with these findings, psychological and psychometric models often differentiate between different kinds of inhibition processes (Mason & Zaccoletti, 2021). Initial taxonomic work by Nigg (2000) distinguished between inhibition regarding motor interference and cognitive inhibition. In a review, Mason & Zaccoletti (2021) acknowledge differences between tasks that require participants to inhibit a motor response (e.g., pressing a button), or a semantic response (i.e., cognitively suppressing a meaning). A well-known psychometric model of inhibition differentiated between suppression of a prepotent response, resistance to distractor interference, and resistance to proactive interference (Friedman & Miyake, 2004). However, in the study by Rey-Mermet et al. (2018), assuming a similar psychometric structure explained little variance in the different inhibition tasks. Overall, there are various different taxonomic models of inhibition, and despite using the same label, the manifold tasks that are often subsumed under this label should not be seen as measuring a unitary process or individual differences therein.

Looking back at the inhibition tasks used by Babai et al. (2014), Vosniadou et al. (2018), and Stricker et al. (2021), one can quickly see that these probably measured very different kinds of inhibition or attentional processes. In the digit cancellation test used by Babai et al. (2014), participants have to cross out specific digits from a list of digits. This requires sustained attention, but it is not clear whether this task really requires inhibition or just sustained recognition of a target stimulus. In the Stroop tasks used by Vosniadou et al. (2018), participants were for example required to indicate a color word’s ink color instead of the color that the word referred to. This task requires inhibition of a motor response (not pressing the button referring to the word, but instead the one that refers to the ink color) according to Nigg (2000), or, in the taxonomy of Friedman and Miyake (2004), inhibition of the prepotent response. In the picture-word task used by Stricker et al. (2021), participants had to indicate whether semantically unrelated words (congruent condition) or related lures (incongruent condition) referred to the same concept as images shown next to the words. This task likely required cognitive inhibition according to Nigg (2000) or resistance to distractor interference according to Friedman and Miyake (2004). Thus, inconsistent effects across these studies might indicate that some of the tasks tapped into an ability that is also activated in the statement-verification task, whereas others might not have done so.

Apart from theoretical considerations such as a lack of unity of inhibitory cognitive functions (Rouder & Haaf, 2019), one explanation for the apparent lack of correlations between tasks is that many inhibition tasks do not produce reliable individual differences. Although inhibition tasks have been used in many correlational and experimental studies for decades and have produced various reproducible effects, recent research has repeatedly found that individual differences on many inhibition tasks lack internal consistency and stability over time (e.g., Enkavi et al., 2019; Rouder & Haaf, 2019). Across the different trials of tasks, there is often little internal consistency in individual differences, meaning that those individuals who show better inhibition in some trials might be among those showing worse inhibition on later trials but within the same testing session (Borgmann et al., 2007). Also over longer periods of time, many measures related to inhibition show a lack of retest-reliability (Enkavi et al., 2019). In other words, individuals who yield strong achievement in a measure of inhibition on 1 day might be among those yielding rather weak achievement a few months later (Enkavi et al., 2019).

As these results from research on inhibition and broader attentional control show, one reason for a lack of correlations between measures can be a lack of stable individual differences within the measures. This raises the question to which extent the statement-verification does actually produce stable individual differences, and, in the case it does so to limited degree, what this implies for research concerned with associations of this task with other measures, as well as for its theoretical interpretation.

1.4 The Present Study

In this study, we examine the degree to which the statement verification task evokes stable individual differences across trials. To this end, we estimate internal consistency and employ factor analysis on the data from one of the largest data sets that have been obtained with the statement-verification task so far, namely, the one from the study by Shtulman & Valcarcel (2012). In their study, Shtulman & Valcarcel (2012) obtained data from a rather large number of 150 college students. They also posed a large number of congruent and incongruent statements on these students that stemmed from 50 different topics of 10 different domains.

If the retrieval- and decision-making processes triggered by the statement-verification task vary strongly in their structure and more basic constituent psychological processes across items, this might result in a lack of individual differences that can be reliably modeled and explained by other variables. Put differently, the cognitive process evoked by the statement-verification task might vary across items that require for example knowledge about different concepts that come from different domains. In this case, individuals who manage to work precisely (i.e., give the correct answer) or efficiently (i.e., provide a quick response) on statements referring to one concept might show worse performance on statements referring to other concepts. On the other hand, if for example a specific kind of inhibition process contributes substantially and similarly to resolving interference between intuitive and scientific concepts across all statements, then this should result in rather high internal consistency and in a visible and strong factor structure across statements. To examine these possibilities, we apply psychometric modeling to examine the following research questions:

  1. 1.

    What is the psychometric structure behind the statement-verification task?

To examine this question, we first estimate internal consistency across statements covering the fifty topics and the different domains. We will group the original rather narrow 10 domains labeled by Shtulman & Valcarcel (2012) into the four broader domains of Mathematics, Physics, Biology, and Astronomy. We will estimate Cronbach’s alpha, as well as the more recent Omega coefficient (Dunn et al., 2014), to see whether differences between congruent and incongruent statements in reaction times and error rates produce stable individual differences across topics and domains. If we find high internal consistencies, implying a stable order of individual differences across statements referring to different concepts, then this would support the assumption that the statement-verification task invokes similar cognitive processes across different concepts. If we find low internal consistencies, then this would support the assumption that the cognitive process triggered by the statement-verification task differs across statements referring to different concepts and does not draw on highly stable inhibition processes.

Since internal consistency informs us about the amount of common individual differences across statements but not necessarily about the underlying factor structure (Sijtsma, 2009), we will also apply confirmatory and exploratory factor analysis. Although factor analysis historically has been developed for research on individual differences (Spearman, 1904), in the last decades, researchers have started applying factor analysis to gain insight into the reliability, validity, and correlates of experimental tasks (e.g., Oberauer et al., 2000; Schmiedek et al., 2007). If there is stable inter-individual variance across the different items that is caused by similar cognitive processes, this should result in a visible factor structure in this analysis. Factor structure is based on the intercorrelations of reaction time or error rate differences across the different items (i.e., concepts). If there are intercorrelations of at least moderate magnitude, then it should be possible to extract a visible factor structure from those intercorrelations. If intercorrelations are small and do not follow a theoretically logical pattern (e.g., items from the same domain showing larger correlations than those from different domains), then no factor structure might arise. Since exploratory factor analysis might not work well in the case of low internal consistency, we will also apply confirmatory factor analysis that allows examining the validity and information value (i.e., reliability) of a predetermined factor structure (Brown & Moore, 2012).

We examine four theoretically grounded possibilities of factor structures that are depicted in Fig. 1 and compare their model fit:

Fig. 1
figure 1

Visual representations of four psychometric models for the statement-verification task. Squared boxes indicate participants’ measured reaction time or error rate differences between congruent and incongruent statements for a concept. Round shapes indicate latent variables (factors) underlying these data

First, the factor structure of the statement-verification task could be one-dimensional (Fig. 1, model 1). This would indicate that intercorrelations between the different items follow a homogenous pattern. In other words, correlations exist between all items and can be well-captured by a single latent variable.

Second, the factor structure could be multidimensional and depend mostly on the content domain of the item (Fig. 1, model 2). For example, individuals who have little difficulty in handling items that are about a specific topic in Physics (e.g., how mass affects objects’ falling) might also show little difficulty in other Physics topics (e.g., thermodynamics of everyday objects). Whether a person has a lot of knowledge within one domain might however not necessarily indicate a lot of knowledge in another domain. This would result in more homogenous intercorrelations across items of the same domain (i.e., Physics, Mathematics, Biology, or Astronomy) than across those from different domains. Such patterns of intercorrelations would show in the best fit of a model that represents each domain within one latent variable.

Third, the factor structure could even encompass more dimensions, namely one for each topic (Fig. 1, model 3). If individuals’ cognitive process varies so much across items that this is the case, we would need one latent variable for each of the 50 concepts that are covered in the task. Consequently, no latent variable could be modeled to capture the correlations between multiple items. This is indicated in Fig. 1 (model 3) by the dashed associations between items, indicating low correlations and no correlations between some items. This reflects an unsystematic pattern that does not result in an identifiable factor structure.

Fourth, an unknown factor structure might be in play (Fig. 1, model 4). In this case, none of the pre-defined models fits, because they do not well-represent the actual empirical correlations in the data. To examine this possibility, we apply exploratory factor analysis. This approach extracts factors in an unsupervised, that is, purely data-driven manner. No specific structure is presupposed.

We will compare these four possibilities by first examining the overall fit of each of the four models individually, and then comparing the fit of all four models.

Since the factor structure might vary between reaction time and error rate differences, we also examine the second research question:

  1. 2.

    Is the factor structure similar between reaction times and error rates?

So far, little explanations have been provided why sometimes results disagree between reaction time and error rate differences on the statement verification task. For example, Stricker et al. (2021) found a positive correlation of Mathematical competence with error rate differences in mathematical items. However, they could not find such a correlation with reaction time differences, with non-significant rs below 0.16. One explanation for this finding might be that reaction times and error rates are differentially affected in the task. Error rates, for example, might be driven more strongly by holding the correct domain-specific content knowledge, whereas reaction times might be more dependent on general abilities such as inhibition. In this case, model 2 might be the best-fitting model for error rates, whereas a different model such as model 1 might fit better for reaction time differences.

2 Method

2.1 Sample

The data set used for this study is the original data set by Shtulman & Valcarcel (2012). The sample encompassed N = 150 undergraduate students from science- and non-science majors. For details on the sample, please see Shtulman & Valcarcel (2012).

2.2 Procedure

Shtulman & Valcarcel (2012) posed 200 statements encompassing 50 concepts from 10 domains (five concepts per domain) on the undergraduate students. The students had to decide, as quickly and as precise as possible, whether each of the statements was true, or false. For further details on the implementation of this task, please see Shtulman & Valcarcel (2012). The domains included Astronomy, Evolution, fractions, genetics, germs, matter, mechanics, physiology, thermodynamics, and waves. Topics within each domain included for example planets, stars, and the solar system within astronomy. A full list of the topics within each domain is provided in Shtulman & Valcarcel (2012). For the present study, the 10 domains are subsumed under the four more general and more clearly defined domains of Physics, Mathematics, Astronomy, and Biology.

2.3 Analysis

For the present analysis, we first computed the differences in reaction times and error rates between congruent and incongruent statements for each individual within each of the 50 topics. This data provided the basis for the factor analyses in the present study. The respective data files, as well as all model syntaxes and outputs, are provided in the supplementary materials under https://osf.io/cfwxq/.

We then estimated internal consistency across the 50 topics via Cronbach’s alpha and omega, the latter of which has been discussed to be less biased in realistic data contexts (Dunn et al., 2014). Although recent methodological research suggests reporting omega (Dunn et al., 2014; Hayes & Coutts, 2020; McNeish, 2018), we report both, alpha and omega, since alpha is probably better known by many researchers and there are also defenses of its use (Raykov & Marcoulides, 2019). We interpret the absolute magnitudes of the estimated coefficients instead of using cut-offs, which is in better accordance with our research aim. We also compute the average observed correlations for reaction time and correctness differences across the 50 topics.

The factor analyses representing models 1 to 4 in Fig. 1 were estimated in the Mplus software package, Version 8.6 (Muthén & Muthén, 2021). We used Bayesian estimation to ensure model convergence at the rather high ratio of model parameters to sample size in our analysis. For estimation, we used four Markov Chain Monte Carlo estimation-chains with 20,000 draws per chain; the first half of which were treated is burn-in, with no thinning (Asparouhov & Muthén, 2010). For evaluation of model convergence, we inspected trace plots and posterior distributions and ensured potential scale reduction factors below 1.05 (Asparouhov & Muthén, 2010). We used default model priors apart from factor loadings, for which we set moderately broad priors of N ~ (0, 2), implying that 95% of prior mass was within the boundaries of [− 3.92; 3.92]. For evaluating fit of the factor analyses, we inspected the common fit statistics root mean square error of approximation (RMSEA), comparative fit index (CFI), and the Tucker-Lewis index (TLI; Kline, 2015), as well as the Bayesian DIC for relative model comparisons (Spiegelhalter et al., 2014). There are many and oftentimes inconsistent recommendations regarding appropriate cut-offs for these fit indices (Greiff & Heene, 2017; Kline, 2015). The cut-offs that we set here for what to consider an appropriately fitting model consider influential papers on this topic (e.g., Hu & Bentler, 1999; Heene et al., 2011; Greiff & Heene, 2017). However, since we were more interested in comparing models than in judging their individual perfect fit to data, we judged a model as fitting sufficiently to warrant careful interpretations with slightly lower cut-offs than sometimes argued for. Specifically, we considered a model fitting sufficiently if the RMSEA was below 0.10, and the CFI as well as the TLI above 0.90. Note that the RMSEA goes from 0 to 1, with lower values indicating better fit, whereas the CFI and TLI also go from 0 to 1 but with higher values indicating better fit.

After fitting and evaluating each model individually based on these fit statistics, we compared the fit statistics for the four different models, also considering which model showed the lowest DIC. Note that the last model (Fig. 1, model 4) required estimation of an exploratory factor analysis. For this analysis, we inspected a parallel analysis (see, e.g., Haslbeck & van Bork, 2022) to examine how many factors the data suggest, evaluating the fit statistics and resulting factor structure for the respective solutions. No absolute fit statistics are available for exploratory factor analysis with Bayesian estimation, but the DIC will help in comparing the fit of the best-fitting model from this analysis to the other confirmatory models.

3 Results

Descriptive statistics are provided in Shtulman & Valcarcel (2012). Results for internal consistency analyses were as follows. For reaction times, Cronbach’s alpha was estimated at σ = 0.60 and omega at 0.61. For error rates, alpha was estimated at σ = 0.60 and omega at 0.60. Average intercorrelations between topics were 0.01 for reaction times and 0.02 for error rates. The correlation matrices showed some negative correlations between topics that caused these almost zero-estimates of average intercorrelations. Taking absolute values of all correlations, thereby transforming the negative estimates into positive ones, resulted in average intercorrelations of 0.07 for reaction times and 0.08 for error rates. Overall, these estimates indicate moderate internal consistencies and low intercorrelations across topics.

The fit statistics of the four estimated factor models for reaction times, as well as for error rates, is provided in Table 1. Four central results are visible from this table. First, whereas all the three models fitted well according to the pre-specified fit criteria for the RMSEA, the CFI and the TLI indicated bad fit for all models. This is true for reaction times, and for error rates. Second, for reaction times as well as for error rates, model 1 (unidimensional model) yielded the best model fit indices. Inspection of this model however revealed various negative factor loadings and many loadings that were negligible in size.

Table 1 Model fit indices for the four models depicted in Fig. 1 for reaction times and error rates

In the exploratory factor analysis, both for reaction times and for error rates, six factors were extracted according to results from the parallel analysis depicted in Fig. 2. The results from the exploratory factor analyses did not reveal any comprehensible factor structure. Concepts that loaded onto a factor did not share any conceptual or other visible similarities (see model outputs in the supplementary materials). The sum of the factor eigenvalues for reaction times was moderately higher for reaction times than for error rates but low for both, with 29% explained variance for reaction times and 27% for error rates when extracting the first six factors. The DIC estimates for the exploratory factor analyses (Table 1) indicated that there was no exploratory data structure that could fit the data better than the confirmatory models. To the contrary, the exploratory analyses achieved mostly higher DIC estimates than the other models, indicating relatively worse fit.

Fig. 2
figure 2

Results from parallel analyses for reaction times (left) and error rates (right)

Overall, both for reaction times and for error rates, no theoretically reasonable factor structure could be modeled—neither in a confirmatory nor in an exploratory manner, although the unidimensional models fitted best. In addition, the fit indices appeared moderately better for error rates than for reaction times.

4 Discussion

In the present study, we analyzed the factor structure of the statement-verification task by Shtulman & Valcarcel (2012). The results show that there are little stable individual differences across items, resulting in weak intercorrelations and a lack of visible factor structure. Individual differences in error rates appear to be slightly more stable than those in reaction times. In the following, we discuss two potential explanations for the general result of the pronounced absence of internal consistency and factor structure in this task. We also discuss the implications that each of the explanations would have for theoretical interpretations of the statement-verification task and for empirical research and the statistical modeling of data within this task. Afterwards, we will outline implications for research on cognitive pluralism beyond this specific task and the limitations of the present study.

4.1 Explanation 1: Varying Cognitive Processes Across Different Topics and Domains

One potential explanation for a lack of stable individual differences is that the constituents of the cognitive process involved in answering the different items differ a lot across the concepts and domains. This might be the case if a large part of the variation in reaction time and error rate differences between congruent and incongruent statements is not caused by variation in cognitive abilities (e.g., inhibition), but by less stable factors such as individuals’ content knowledge. Content knowledge may differ strongly in its level of expertise, and in its internal structure (Edelsbrunner et al., 2022), across domains and even across topics (i.e., concepts) within domains. In this case, the cognitive resource that individuals draw upon in the answer process, namely, their content knowledge, is highly variable.

This explanation is in accordance with the finding by Stricker et al. (2021) that mathematical competence, which relies strongly on domain-specific content knowledge in Mathematics, is a good predictor of individual differences on the statement-verification task, at least regarding error rates.

This finding is also in accordance with recent literature (Edelsbrunner et al., 2022; Stadler et al., 2021; Taber, 2018) arguing that content knowledge likely does not produce stable individual differences. Rather, knowledge is a construct for which we should generally assume that internal consistency, which is based on intercorrelations between knowledge pieces, is rather moderate. Note that moderate internal consistency, be it on the side of the predictor (e.g., mathematical knowledge or competence) or on the side of the dependent variable (e.g., reaction time or error rate differences in the statement-verification task), does not fully undermine that such constructs can explain a good share of the variation in one another. However, this changes the meta-theoretical properties of the constructs and the adequacy of different kinds of statistical models. Instead of assuming that the phenomena produced within statement-verification- and similar paradigms are well-represented as stable trait in factor models, we suggest considering that these would be better represented as a composite variable that is composed of heterogeneous parts which nonetheless have a common function (Schuberth, in press). In accordance with Taber (2018), Stadler et al. (2021), and Edelsbrunner et al. (2022), modeled as a composite variable, individual differences involved in retrieving scientific knowledge that may disagree with earlier intuitions would represent an index of different processes across multiple contexts, rather than a unitary cognitive process.

4.2 Explanation 2: Instable Inhibition Processes Across Topics and Domains

Another explanation, related to the first one, is that the cognitive process across topics and domains differs not because inhibition is not part of this process, but because the inhibition process evoked in the statement-verification task itself is highly variable across statements. For example, the inhibition process evoked by the statement-verification task might be similar to the process captured by the Flanker task. In the Flanker, individuals have to suppress visual stimuli that are incongruent with the target stimulus, requiring the handling of distractor interference (Verbruggen et al., 2004). Although the Flanker task produces very reliable experimental effects, its internal consistency, particularly for reaction time differences between congruent and incongruent trials, is commonly low (Draheim et al., 2021). Consequently, if an inhibition process similar to the handling of distractor interference is evoked in the statement-verification paradigm, the paradigm might produce weak internal consistencies despite similar inhibition processes being triggered across topics and domains. Put differently, the absence of stable individual differences on the statement-verification task indicates that if inhibition or other cognitive abilities play a role in this task, these must be abilities which themselves show little internal consistency.

4.3 Limitations and Implications for Future Research

Limitations of the present study are the limited number of participants in the data set and that a single data set was used for analysis. For a lab-based experiment with a large number of trials, the number of participants was large. This allowed the factor analyses to converge by means of Bayesian estimation, which can typically handle small sample sizes than frequentist estimation (Smid et al., 2020). Still, we suggest replicating the present results in future research and to examine how internal consistencies and factor structures might differ between samples.

Despite the discussed theoretical implications, one implication of our findings for future research is that the temporal stability of the cognitive process involved in the statement-verification task should be examined. Our findings are in accordance with recent literature pointing out limited reliability for individual differences in cognitive tasks capturing rather basic cognitive processes (Draheim et al., 2019, 2021). The absence of stable individual differences that we have found across contexts (i.e., concepts) does not undermine the possibility that individual differences might be more stable over time. A similar finding has been made by Neubauer & Hofer (2022) for a situational judgment test. Despite low internal consistency, the test evoked high retest-reliability over 2 weeks. However, if the statement-verification task relies strongly on content knowledge, then retesting-periods should be chosen during which individual differences therein are unlikely to change, to avoid artificial attenuation. If, despite little stable factor structure, retest-reliability for the statement-verification task turns out at least moderate, then this would speak for the interpretation of the task as triggering a stable cognitive process at least to some extent. In accordance with Draheim et al. (2021) who argue for error rates as a more reliable source of information, the error rates in our analyses appeared to be slightly more reliable than reaction times. This raises the question why this should be the case. A convincing theoretical account of differences in patterns between reaction times and error rates appears to be yet missing.

In our view, the assumption that an intuitive conception must be suppressed to work with the scientific concept (e.g., Vosniadou et al., 2018) does not allow a precise prediction of what happens to reaction times during conceptual interference. If inhibition really plays a role in solving the interference, then it depends on the nature of the inhibition process in play. On the one hand, inhibition on the statement-verification task might be described as a dichotomous yes/no process in which the intuitive conception is either suppressed, or not. From this perspective, if the suppression process is successfully triggered, then this would increase reaction time and lead to the correct response given. If the suppression process is not successfully triggered, this would lead to decreased reaction time in comparison and to giving the wrong response. On the other hand, solving the interference could be achieved via decreasing the activation strength of the intuitive conception in comparison to the scientific concept. Better inhibition in this process might be described as the process of decreasing the activation strength of the intuitive conception until it is below a certain threshold that enables reasoning with the scientific concept to work successfully and faster. This would imply shorter reaction time differences and getting to the correct response more likely for individuals with better inhibition. We suggest trialing different inhibition tasks that trigger both kinds of processes. For example, an antisaccade- or Stroop-like task in which one either manages to suppress the intuitive reaction or not could be used to test for the first possibility, and a more semantic picture-word task in which the activation of the intuitive semantic concept has to be decreased to test the second possibility. These considerations and the available evidence of Vosniadou et al. (2018) finding relations with Stroop-like tasks but Stricker et al. (2021) finding no relations with a picture-word task might point towards a dichotomous suppression process (either managing to suppress the intuitive conception, or not) being triggered in the statement-verification task. However, the methodological issue in the study by Vosniadou et al. (2018) in which they only analyzed data from incongruent trials instead of differences scores undermines such conclusions, and replications with more appropriate analytic approaches are necessary. To get a more complete picture of the cognitive process taking place, cognitive models such as the diffusion-model might be employed to test predictions about the role of inhibition (Vandekerckhove et al., 2011).

A statistical option for dealing with low internal consistency would be the application of latent variable models that can correct for the implied large amount of measurement error. For example, structural equation modeling provides a tool for working on the level of measurement error-free constructs instead of mean scores or reaction times that are laden with error (White et al., 2022). At the same time, we caution against relying on latent variable models because of the strong theoretical package that they carry with them. Specifically, in a latent variable model, indicator variables of the underlying measurement error-free construct are assumed to be exchangeable (Edelsbrunner, 2022, see also Robitzsch & Lüdtke, 2014 for detailed discussion on when this is the case). In the statement-verification task, this would imply the assumption that all topics and domains are affected by the same latent source of individual differences. We believe that this assumption does not hold theoretically, particularly to the degree that mastery of domain-specific content knowledge really plays a role in the task. Still, future research might try out correcting model estimates for low measurement error by using latent variables. Results from such models could be compared to analytic approaches employing composites (Schuberth, in press) or mixed models (Rouder & Haaf, 2019) that handle measurement error in conceptually different ways.

Future research should generalize our findings beyond the statement-verification paradigm by Shtulman & Valcarcel (2012). For example, data from the paradigms by Babai & Amsterdamer (2008), Vosniadou et al. (2018), Potvin & Cyr (2017), Allaire-Duquette et al. (2021), and Stricker et al., (2021; these authors used the same paradigm but with different concepts only within the domain of Mathematics) could be re-analyzed to establish whether factor structure can be obtained from any paradigms meant to evoke interference between intuitive conceptions and scientific concepts. If our results hold across different paradigms, this will indicate that little internal consistency and factor structure are general phenomena of handling conceptual interference. This would also meet criticisms of the task by Shtulman & Valcarcel (2012), for example that the congruent and incongruent statements in their task in some cases vary in complexity level such that incongruent statements are prone to lead to increased reaction times independently of potential conceptual interference. In our perspective, such criticisms have theoretical value, but they fail to explain that increased reaction times and error rates occur across all the 50 concepts in the task. Still, comparing and generalizing results across different paradigms is likely to result in new insights regarding the robustness of the phenomenon and its relation to cognitive abilities such as inhibition. In this regard, hierarchical statistical models, which allow parameters to vary across statements from different concepts and explaining this variation with covariates on the level of the concept or learner, might be an informative approach (see e.g., Vandekerckhove et al., 2011).

Beyond the present application, we would like to point towards the more general potential of psychometric modeling for educational research involving similar tasks. Tools such as factor analysis originally have been developed for research that focuses on individual differences (Spearman, 1904). For this reason, factor-analytic techniques have not been used initially in research that focused on quantifying cognitive phenomena within individuals, rather than between individuals. However, in the last decades, the application of factor-analytic methods has yielded novel insights into the reliability, validity, and correlates of many tasks that used to be applied mostly with general and experimental, rather than differential questions in mind (e.g., Oberauer et al., 2000; Schmiedek et al., 2007). Similarly, we would like to incite educational researchers concerned with cognitive or typical lab-based research to also consider employing these techniques. We see the primary theoretical value in these methods in their potential to bridge research on individuals’ cognition with research on individual differences. This might be of particular interest to researchers concerned with conceptual knowledge and its development. Conceptual change research, which used to be concerned mostly with individuals’ cognitive processes that lead to knowledge restructuring, has moved more strongly towards group-based statistical analyses (e.g., Merz et al., 2016). These analyses do not allow for testing whether effects of incongruence actually occur within individuals, but only group differences therein. We suggest applying factor analysis and related psychometric methods in future search to examine whether and how group-based findings generalize to individuals, and vice versa.