Characterizing belief bias in syllogistic reasoning: A hierarchical Bayesian metaanalysis of ROC data
Abstract
The beliefbias effect is one of the moststudied biases in reasoning. A recent study of the phenomenon using the signal detection theory (SDT) model called into question all theoretical accounts of belief bias by demonstrating that beliefbased differences in the ability to discriminate between valid and invalid syllogisms may be an artifact stemming from the use of inappropriate linear measurement models such as analysis of variance (Dube et al., Psychological Review, 117(3), 831–863, 2010). The discrepancy between Dube et al.’s, Psychological Review, 117(3), 831–863 (2010) results and the previous three decades of work, together with former’s methodological criticisms suggests the need to revisit earlier results, this time collecting confidencerating responses. Using a hierarchical Bayesian metaanalysis, we reanalyzed a corpus of 22 confidencerating studies (N = 993). The results indicated that extensive replications using confidencerating data are unnecessary as the observed receiver operating characteristic functions are not systematically asymmetric. These results were subsequently corroborated by a novel experimental design based on SDT’s generalized area theorem. Although the metaanalysis confirms that believability does not influence discriminability unconditionally, it also confirmed previous results that factors such as individual differences mediate the effect. The main point is that data from previous and future studies can be safely analyzed using appropriate hierarchical methods that do not require confidence ratings. More generally, our results set a new standard for analyzing data and evaluating theories in reasoning. Important methodological and theoretical considerations for future work on belief bias and related domains are discussed.
Keywords
Deductive reasoning Syllogisms Belief bias Signal detection theory Hierarchical Bayesian MetaanalysisAll flowers have petals.
All roses have petals.
Therefore, all roses are flowers.
This syllogism is logically invalid, as the conclusion (i.e., the sentence beginning with “Therefore”) does not necessarily follow from the two premises, assuming the premises are true (i.e., the conclusion is possible, but not necessary). However, the fact that this syllogism’s conclusion states something consistent with realworld knowledge leads many individuals to endorse it as logically valid. More generally, syllogisms with believable conclusions are more often endorsed than structurally identical syllogisms that include unbelievable conclusions instead (e.g., “no roses are flowers”). At the heart of the belief bias effect is the interplay between individuals’ attempts to rely on the rules of logic and their general tendency to incorporate prior beliefs into their judgments and inferences (e.g., Bransford & Johnson, 1972; Cherubini et al., 1998; Schyns & Oliva, 1999). Although a reliance on prior belief is believed to be desirable and adaptive in many circumstances (Skyrms, 2000), it can be detrimental in cases where the goal is to assess the form of the arguments (e.g., in a court of law). Moreover, beliefs are often misguided and logical reasoning is necessary to determine if and when this is the case.
These detriments are likely to be far reaching in our lives, as highlighted by early work focusing on the socialpsychological implications of belief bias (e.g., Feather, 1964; Kaufmann & Goldstein, 1967). Batson (1975), for example, found that presenting evidence that contradicts stated religious belief sometimes increases the intensity of belief. Motivated reasoning effects of this sort have been reported in hundreds of studies (Kunda, 1990), including, appropriately, on the Wason selection task (Dawson et al., 2002). Indeed, one of the foundational observations in the reasoning literature is the tendency for people to confirm hypotheses rather than disconfirm them (Wason, 1960, 1968; Wason and Evans, 1974), often referred to as confirmation bias (Nickerson, 1998) or attitude polarization (Lord et al., 1979). What makes belief bias notable is that, unlike in studies of motivated reasoning or attitude polarization, the beliefs that bias syllogistic reasoning are not of particular import to the reasoner (such as the “all roses are flowers” example above). Moreover, syllogistic reasoning offers a very clear logical standard by which to contrast the effect of belief bias. Thus, in a certain sense, developing a good account of belief bias in reasoning is foundational to understanding motivated reasoning and attitude polarization.
Theoretical accounts of belief bias
In the last three decades, several theories have been proposed to describe how exactly beliefs interact with reasoning processes (e.g., Dube et al., 2010; Evans et al., 1983, 2001; Klauer et al., 2000; Markovits & Nantel, 1989; Newstead et al., 1992; Oakhill & JohnsonLaird, 1985; Quayle & Ball, 2000). For example, according to the selective scrutiny account (Evans et al., 1983), individuals uncritically accept arguments with a believable conclusion, but reason more thoroughly when conclusions are unbelievable. In contrast, proponents of a misinterpreted necessity account (Evans et al., 1983; Markovits & Nantel, 1989; Newstead et al., 1992) argue that believability only plays a role after individuals have reached conclusions that are consistent with, but not necessitated by, the premises (as in the example above).
Alternatively, mentalmodel theory (JohnsonLaird, 1983; Oakhill & JohnsonLaird, 1985) proposes that individuals evaluate syllogisms by generating mental representations that incorporate the premises. When the conclusion is consistent with one of these representations, the syllogism tends to be perceived as valid. However, when the conclusion is seen as unbelievable, the individual is assumed to engage in the creation of alternative mental representations that attempt to refute the conclusion (i.e., counterexamples). Only when a model is found wherein the (unbelievable) conclusion is consistent with these alternative representations, is the syllogism perceived to be valid.
Another account, transitivechain theory (Guyote & Sternberg, 1981) proposes that reasoners encode setsubset relations between the terms of the syllogism inspired by the order in which said terms are encountered when reading the syllogism. These mental representations are then combined according to a set of matching rules with different degrees of exhaustiveness. The theory predicts that unbelievable contents add an additional burden to this information processing, leading to worse performance compared to syllogisms with believable contents.
Yet another account, selective processing theory (Evans et al., 2001), proposes that individuals use a conclusiontopremises reasoning strategy. Participants are assumed to first evaluate the believability of the conclusion, after which they conduct a search for additional evidence. Believable conclusions trigger a search for confirmatory evidence, whereas unbelievable conclusions induce a disconfirmatory search. For valid problems the conclusion is consistent with all possible representations of the premises, so believability will not have a large effect on reasoning. By contrast, for indeterminately invalid problems a representation which is inconsistent with the premises can typically be found with a disconfirmatory search, leading to increased logical reasoning accuracy for unbelievable problems. Most recently, the model has been extended to predict that individual differences in thinking ability mediate these effects, such that more able thinkers are more likely to be influenced by their prior beliefs (Stupple et al., 2011; Trippas et al., 2013).
This brief description does not exhaust the many theoretical accounts proposed in the literature, each of them postulating distinct relationships between reasoning processes and prior beliefs (e.g., Newstead et al., 1992; Quayle & Ball, 2000; Polk & Newell, 1995; Thompson et al., 2003; for reviews see Dube et al., 2010; Klauer et al., 2000). However, irrespective of the precise interplay between beliefs and reasoning processes, a constant feature of these theories is that the ability to discriminate between logically valid and invalid syllogisms is predicted to be higher when conclusions are unbelievable (although the opposite prediction has also been made by transitivechain theory). In sum, virtually all theories propose that beliefs have some effect on reasoning ability, the latter having been operationalized in terms of the ability to discriminate between valid and invalid syllogisms. In this manuscript we test if believability affects discriminability using a mathematical model based on signal detection theory. Before describing this model in detail, it is important to consider the motivation behind this quite prevalent assumption.
The design of Evans et al. (1983, Experiment 1), example syllogisms, and endorsement rates
Conclusion  

Syllogism  Believable  Unbelievable 
Valid  No cigarettes are inexpensive.  No addictive things are inexpensive. 
Some addictive things are inexpensive.  Some cigarettes are inexpensive.  
Therefore, some addictive things are not  Therefore, some cigarettes are not addictive.  
cigarettes.  
P(“valid”) = .92  P(“valid”) = .46  
Invalid  No addictive things are inexpensive.  No cigarettes are inexpensive. 
Some cigarettes are inexpensive.  Some addictive things are inexpensive.  
Therefore, some addictive things are not  Therefore, some cigarettes are not addictive.  
cigarettes.  
P(“valid”) = .92  P(“valid”) = .08 
The endorsement rates obtained with such a \(2\times 2\) design can be decomposed in terms of the contributions of logical validity (i.e., logic effect), conclusion believability (i.e., belief effect), and their interaction, as would be done with a linear model such as multiple regression. Taking Table 1 as an example, there is an effect of logical validity, with valid syllogisms being more strongly endorsed overall than their invalid counterparts ((.92 + .46)/2 − (.92 + .08)/2 > 0). There is also an effect of conclusion believability, as syllogisms with believable conclusions were endorsed at a much greater rate than syllogisms with unbelievable conclusions ((.92 + .92)/2 − (.46 + .08)/2 > 0). Finally, there is an interaction between validity and believability (Logic \(\times \) Belief interaction): the difference in endorsement rates between valid and invalid syllogisms is much smaller when conclusions are believable than when they are unbelievable ((.92 − .92) − (.46 − .08) = −.38). At face value, the negative interaction emerging from these differences suggests that individuals’ reasoning abilities are reduced when dealing with syllogisms involving believable conclusions (although the effect is typically interpreted the other way around, such that people reason better when syllogisms have unbelievable conclusions; e.g., Lord et al., 1979). Since Evans et al., (1983), the interaction found in Logic \(\times \) Belief experimental designs like the one illustrated in Table 1 is usually referred to as the interaction index.
Overall, these results suggest three things: First, that individuals can discriminate valid from invalid arguments, albeit imperfectly (i.e., individuals can engage in deductive reasoning). Second, that people are more likely to endorse syllogisms as valid if their conclusions are believable (i.e., consistent with realworld knowledge) than if they are not. Third, that people are more likely to discriminate between logically valid and invalid conclusions when those conclusions are unbelievable. In contrast with the main effects of logical validity and believability, which are not particularly surprising from a theoretical point of view (Evans and Stanovich, 2013), the Logic \(\times \) Belief interaction has been the focus of many research endeavors and is considered to be a basic datum that theories of the belief bias need to explain in order to be viable (Ball et al., 2006; Evans & CurtisHolmes, 2005; Morley et al., 2004; Newstead et al., 1992; Quayle & Ball, 2000; Shynkaruk & Thompson, 2006; Stupple & Ball, 2008; Thompson et al., 2003; Roberts & Sykes, 2003).
Researchers’ reliance on the interaction index to gauge changes in reasoning abilities was the target of extensive criticisms by Klauer et al., (2000) and Dube et al., (2010). Both Klauer et al. and Dube et al. demonstrated that the linearmodelbased approach used to derive the interaction index hinges on questionable assumptions regarding the way endorsement rates for valid and invalid syllogisms relate to each other. They argued that any analysis of the beliefbias effect rests upon some theoretical measurement model whose core assumptions need to be checked before any interpretation of the results can be safely made. Using extended experimental designs that go beyond the traditional Logic \(\times \) Belief design (e.g., introducing responsebias manipulations, payoff matrices, the use of confidencerating scales) and including extensive modelvalidation tests, Klauer et al. and Dube et al. showed that the assumptions underlying the linearmodelbased approach are incorrect, raising doubts about studies that take the interaction index as a direct measure of change in reasoning abilities. But whereas Klauer et al.’s results were still in line with the notion that conclusion believability affects the ability to discriminate between valid and invalid syllogisms, the work by Dube et al., (2010) argued that conclusion believability does not affect individuals’ discrimination abilities at all. Instead, their account suggests that conclusion believability affects only the general tendency towards endorsing syllogisms as valid (irrespective of their logical status). Dube et al.’s results are therefore at odds with most theories of deductive reasoning (but see Klauer & Kellen, 2011 and the response by Dube et al., 2012).^{1}
The results of Dube et al., (2010) can be interpreted as calling for the establishment of a new standard for methodological and statistical practices in the domain of syllogistic reasoning and deductive reasoning more generally (Heit and Rotello, 2014). Simply put, the use of flawed reasoning indices should be abandoned in favor of extended experimental designs that allow for the testing of the assumptions underlying the data analysis method. Specifically, their simulation and experimental results suggest moving from requesting binary judgments of validity to the use of experimental designs that request participants to report their judgments using a confidencerating scale (e.g., a sixpoint scale from 1: very sure invalid to 6: very sure valid). These data can then be used to obtain receiver operating characteristic (ROC) functions and fit signal detection theory (SDT), a prominent measurement model in the literature that has been successfully applied in many domains (e.g., memory, perception; for introductions, see Green & Swets, 1966; Kellen & Klauer, 2018; Macmillan & Creelman, 2005). The parameter estimates provided by the SDT model can inform us on the exact nature of the observed differences in endorsement rates. Although experimental data from previous studies could potentially be reanalyzed with a version of SDT—known as the equal variance SDT model—which does not require confidence ratings, there is evidence from simulations suggesting that reliance on this simpler version of SDT would hardly represent an improvement over the interaction index (Heit & Rotello, 2014): a more extensive version of SDT—known as the unequal variance SDT model–appears to be necessary.^{2}

Can the equal variance SDT model provide a sensible account of the data, dimissing the need for extended experimental designs?

Does the believability of conclusions affect people’s ability to discriminate between valid and invalid syllogisms?
In addition to these main questions, we will also briefly revisit the evidence for the role of individual differences in belief bias for a subset of the data for which this information is available. Our results discussed below show that the confidencerating data are very much in line with the predictions made by the equal variance SDT model which can be applied without the availability of confidence ratings, suggesting that previously published belief bias studies can be reanalyzed using a probit or logit regression. The results also suggest that despite the heterogeneity found among participants and stimuli, the believability of conclusions does not generally affect people’s ability to discriminate between valid and invalid syllogisms when considered across the entire corpus, partially confirming (Dube et al., 2010) original account. However, a closer inspection using individual covariates suggest a relationship between people’s reasoning abilities and the way they are affected by beliefs, as suggested by Trippas et al. (2013, 2014, 2015). Altogether, these results suggest that syllogistic reasoning should be analyzed using hierarchical statistical methods together with additional individual covariates. In contrast, the routine collection of confidence ratings with the aim of modeling data, while certainly a possibility, is by no means necessary.
The remainder of this manuscript is organized as follows: First, we will review some of the problems associated with traditional analyses of the beliefbias effect based on a linear model, followed by an introduction to SDT and the analysis of ROC data. We then turn to the risks associated with the aggregation of heterogeneous data across participants and stimuli and how they can be sidestepped through the use of hierarchical Bayesian methods. In addition to the metaanalysis, we report a series of validation checks that corroborate our findings. Next, we present data from a new experiment using a Kalternative forced choice task which corroborates the main conclusion from our metaanalysis. Finally, we discuss potential future applications for the dataanalytic methods used here and theoretical implications for belief bias.
Implicit linearmodel assumptions and SDTbased criticisms
The assumption that ROCs are linear (with slope 1) is questionable, given that the ROCs obtained across a wide range of domains tend to show a curvilinear shape (Green and Swets, 1966; Dube & Rotello, 2012); but see Kellen et al., (2013). The possibility of ROCs being curvilinear is problematic for the linear model given that it can misinterpret differences in response bias as differences in discriminability. For example, in the right panel of Fig. 1 we illustrate a case in which the discriminability for believable syllogisms is found to be lower than for unbelievable syllogisms (negative interaction index \(\beta _{I}\)), despite the fact that according to SDT (dashed curve) the observed ROC points can be understood as differing in terms of response bias alone. Moreover, potentially curvilinear ROC shapes are theoretically relevant given that they are considered a signature prediction of signal detection theory (SDT).
Signal detection theory
According to the SDT model, the validity of syllogisms is represented on a continuous latentstrength axis, which in the present context we will simply refer to as argument strength (Dube et al., 2010). The argument strength of a given syllogism can be seen as the output of a participant’s reasoning processes (e.g., Chater & Oaksford, 1999; Oaksford & Chater, 2007). A syllogism is endorsed as valid whenever its argument strength is larger than a response criterion \(\tau \). When the syllogism’s argument strength is smaller than the response criterion, the syllogism is deemed as invalid. This response criterion is assumed to reflect an individual’s general bias towards endorsement: more lenient individuals will place the response criterion at lower argumentstrength values than individuals who tend to be quite conservative in their endorsements. Different criteria have consequences for the amount of correct and incorrect judgments that are made: for example, conservative criteria lead to less false alarms than their liberal counterparts but also lead to less hits.
If one would assume the parametrization in which \(\mu _{I} = \mu _{V}\) and \(\sigma _{I} = 1\), the similarity between Eqs. 2 and 3 of the linear model and Eqs. 8 and 9 of the SDT model becomes obvious. Response criterion \(\tau \) plays the same role as the intercept \(\beta _{0}\), in that both determine the endorsement rate for invalid syllogisms. Meanwhile, the mean \(\mu _{V}\) plays the role of \(\beta _{L}\) by capturing the effect of Logic (L)— i.e., a reflection of reasoning aptitude, with a value of 0 suggesting an inability to discriminate between valid and invalid arguments. From this standpoint, the differences between the linear model and SDT models essentially boil down to the latter assuming a parameter \(\sigma _{V}\) that modulates how the responsecriterion \(\tau \) affects the hit rate, and the use of the nonlinear function \({\Phi }(\cdot )\) which translates the latent argumentstrength values into manifest response probabilities (DeCarlo, 1998) and maps the real line onto the probability scale. Although these differences may seem minor or even pedantic, they are highly consequential, as they ultimately lead both models to yield rather distinct interpretations of the same data (see the right panel of Fig. 1). Figure 3 shows ROCs generated by the SDT model under different parameter values: as the ability to discriminate between valid and invalid syllogisms increases (e.g., \(\mu _{V}\) increases), so does the area under the ROC. Moreover, parameter \(\sigma _{V}\) affects the symmetry/asymmetry of the ROC relative to the negative diagonal, with ROCs only being symmetrical when \(\sigma _{V} = \sigma _{I}\). Note that all these ROCs are curvilinear, in contrast with the unitslope linear ROCs predicted by the ANOVA model (compare with the left panel of Fig. 1).
Dube et al., (2010) showed that the linear model can produce an inaccurate account of the data simply due to the mismatch between the model’s predictions and the observed ROC data. Specifically, if the ROCs are indeed curved as predicted by SDT, then the linear model is likely to incorrectly interpret these data as evidence for a difference in discrimination. This difference in discrimination would be captured by a statistically significant interaction index. For example, consider the right panel of Fig. 1, which illustrates a case where the hit and falsealarm rates observed across believability conditions all fall on the same curved ROC, a pattern indicating that these conditions only differ in terms of the response bias being imposed (i.e., these rates reflect the same ability to discriminate between valid and invalid syllogisms): the linear model cannot capture both ROC points in the same unitslope line, which yields the erroneous conclusion that there is a difference in the level of valid/invalid discrimination for believable and unbelievable syllogisms (a difference captured by the interaction index \(\beta _{I}\)). Note that this erroneous conclusion does not vanish by simply collecting more data—in fact, additional data will only reinforce the conclusion, an aspect that can lead researchers to a false sense of reassurance. Rotello et al., (2015) discussed how researchers tend to be less critical of the interpretation of their measurements when they are replicated on a regular basis. Given that negative interaction indices are regularly found in syllogisticreasoning studies, very few researchers have considered evaluating the measurement model that underlies this index (the exceptions are Dube et al., 2010; Klauer et al., 2000).
SDT’s point measure d’: A more efficient, equally valid approach?
Like SDT, the simpler EVSDT model also predicts curvilinear ROCs, however they are all constrained to be symmetrical with respect to the negative diagonal. This additional constraint raises questions regarding the suitability of EVSDT: do the EVSDT’s predictions match the ROC data? And if not to which extent does this mismatch affect the characterization of the beliefbias effect? In other domains such as recognition memory and perception, ROCs have been found to be asymmetrical, with \(\sigma _{V} > \sigma _{I}\) (see Dube & Rotello, 2012; Starns et al., 2012). When applied to these asymmetric ROCs, \(d^{\prime }\) provides distorted results, with discriminability being overestimated in the presence of stricter response criteria, and underestimated for more lenient criteria (for an overview, see Verde et al., 2006). Similar results have been found in the case of syllogistic reasoning, with asymmetrical ROCs speaking strongly against the EVSDT model. Dube et al., (2010) found the restriction \(\sigma _{V} = \sigma _{I}\) to yield predictions that systematically mismatch the ROC data.
These shortcomings were corroborated in a more comprehensive evaluation by Heit and Rotello (2014). They reported a simulation showing that, if anything, the use of \(d^{\prime }\) only amounts to a small improvement over the interaction index. Specifically, data were generated via a bootstrap procedure and discrimination for syllogisms with believable and unbelievable conclusions were assessed with \(d^{\prime }\) and the interaction index. Both measures were found to be strongly correlated and very often reached the same incorrect conclusion. The only difference was that \(d^{\prime }\) led to incorrect conclusions slightly less often than the interaction index. Overall, the use of the EVSDT model and its measure \(d^{\prime }\) does not seem to constitute a reasonable solution for the study of the beliefbias effect. These results suggest that researchers need to rely on extended designs (e.g., confidence ratings) whenever possible (Heit & Rotello, 2014, p. 90). But as it will be shown below, the dismissal of the EVSDT model and \(d^{\prime }\) is far from definitive. In fact, it is entirely possible that this dismissal is the byproduct of an unjustified reliance on ROC data that aggregate responses across heterogeneous participants and stimuli.
The problem of aggregating data from heterogeneous sources
One of the challenges experimental psychologists regularly face is the sparseness of data at the level of individuals as well as stimuli. Typically, one can only get a small number of responses from each participant, only have a small set of stimuli available, and can only obtain one response per participantstimulus pairing. In the end, only very little data is left to work with. A typical solution to this sparseness problem consists of aggregating data across stimuli or participants. Although previous work has shown that although data aggregation is not without merits (Cohen et al., 2008), its use implies the assumption that that there are no differences between participants nor stimuli. In the presence of heterogeneous participants and stimuli, this assumption can lead to a host of undesirable effects. One classic demonstration of the risks of data aggregation in the social sciences is Condorcet’s Paradox (Condorcet, 1785), which demonstrates how preferences (e.g., between political candidates) aggregated across individuals might not reflect properties that hold for any individual. In this specific case, it is shown that aggregated preferences often violate a fundamental property of rational preferences known as transitivity (e.g., if option A is preferred to B, and option B is preferred to C, then option A is preferred to C), even though all of the aggregated individual preferences were actually transitive (for a discussion, see Regenwetter et al., 2011).
In the case of traditional dataanalytic methods such as linear models, the aggregation of data coming from heterogeneous participants and stimuli often leads to distorted results and severely inflated type I errors. These distortions can also compromise the replication and generalization of findings (for an overview, see Judd et al., 2012). Other approaches which do not rely on aggregation, for instance analyzing the data for each participant individually prior to summarizing them, is also not ideal given that this approach may seriously inflate the probability of type 2 errors due to the data sparseness. The problems associated with data aggregation and pure individuallevel analysis have led to a growing reliance on statistical methods that do not rely exclusively on either, but a compromise between both, effectively establishing a new standard in terms of data analysis (e.g., Baayen et al., 2008; Barr et al., 2013; Snijders & Bosker, 2012). Some of these methods have been adopted in recent work on probabilistic and causal reasoning (e.g., Haigh et al., 2013; Rottman & Hastie, 2016; Singmann et al., 2016, 2014; SkovgaardOlsen et al., 2016), but these methods have not been applied to the study of the measurement assumptions underlying belief bias. For example, for a very long time it was established in the literature that the effects of practice in cognitive and motor skills were better characterized by a power function than by an exponential function (Newell et al., 1981). However, this finding was based on functions aggregated across participants. Later simulation work showed that when agregated across participants, exponential practice functions were better accounted for by a power function (Anderson & Tweney, 1997; Heathcote et al., 2000). In an analysis involving data from almost 500 participants, Heathcote et al. showed that nonaggregated data were better described by an exponential function, a result that demonstrates how a reliance on aggregate data can lead researchers astray for several decades. Another example can be found in the domain of cognitive neuroscience, where it is common practice to aggregate across multiple participants’ fMRIdata. In contrast to the prevailing assumption in the field, individual patterns of brain activity are not exclusively driven by external or measurement noise, but are potentially linked to systematic interindividual differences in strategy use (Miller et al., 2002).
Aggregating across heterogeneous stimuli
When the researcher must aggregate across easy and hard syllogisms because they cannot be differentiated a priori, one would hope to obtain parameter estimates that are in line with the average of the distributions’ parameters, namely \(\mu _{V}= 2\) and \(\sigma _{V}= 1\). Note that this average would respect the fact that all distributions have the same variances, yielding symmetric ROCs. Unfortunately, the parameter estimates one obtains from aggregating across stimuli does not produce such a result. Instead, the parameter estimates obtained underestimate \(\mu _{V}\) and inflate \(\sigma _{V}\). The problem here is that the average of both distributions will have a greater standard deviation than the average of \(\sigma _{V,\text {easy}}\) and \(\sigma _{V,\text {hard}}\). In this particular example, data aggregation led to an asymmetric ROC (see the center panel of Fig. 5) with estimates \(\mu _{V}= 1.88\) and \(\sigma _{V} = 1.32\).^{4} Based on these estimates, a researcher would erroneously conclude that ROCs are asymmetric and that one is required to estimate \(\sigma _{V}\) (perhaps using a confidencerating task) in order to accurately characterize the data. To make matters worse these distortions are asymptotic in the sense that they would not vanish by simply having more data. On the contrary, they only reinforce the distorted results. These results show that a scenario in which the rejection of EVSDT is driven by the use of heterogeneous stimuli is far from unlikely, given that there is substantial variability in the propensity to accept different syllogistic structures all classified as similarly complex (Evans et al.,, 1999). The presence of such asymptotic distortions is particularly troubling given that it can lead researchers to dismiss a large body of work in favor of new studies involving extended experimental designs.
Aggregating across heterogeneous participants
We now turn to two examples involving the aggregation of judgments coming from two heterogeneous participants A and B. The first example is formally equivalent to the one just described in the subsection above (i.e., the left and center panels of Fig. 5 serve to illustrate it as well). Assume that participant A shows worse discriminability than (μ_{V,A} = 1) than participant B (μ_{V,B} = 3), with everything else being equal (again, \(\mu _{0} = 1\), and \(\sigma _{V,\text {easy}} = \sigma _{V,\text {hard}} = \sigma _{I} = 1\)). Note that both participants’ ROCs are symmetrical. As in the case of heterogeneous stimuli, the aggregation of the data from these two individuals would lead to an asymmetric ROC and an inflated estimate of \(\sigma _{V}\) (again, 1.32). In this scenario, the fact that one participant performs better than the other one is enough to distort the overall shape of the ROC. Once again, this possibility is far from unexpected in light of the fact that individual differences in reasoning ability are commonly found (Stanovich, 1999; Trippas et al., 2015).
The second example concerns differences in response bias, which can also produce distortions: For example, let us imagine two participants that have the same ability to discriminate between valid and invalid syllogisms, \(\mu _{V}= 1\), but differ in terms of their response biases. Specifically, let us assume that participant A relies on a conservative criterion \(\tau = 1.5\) (i.e., is less likely to endorse syllogisms), whereas participant B relies on the more lenient criterion \(\tau = 0\) (i.e., is more likely to endorse syllogisms). The hit and falsealarm rate pairs for these two participants are (.31, .07) and (.84,.50), respectively. The pair obtained when aggregating both pairs, (.57, .28), is associated with \(\mu _{V}\)= .76, a value that is smaller than any individual’s discriminability. As shown in the right panel of Fig. 5, the concavity of the ROC function implies that the average of any two hit and falsealarm pairs coming from a single function (i.e., with the same discriminability) will always result in a pair that falls below that function. When evaluating a single experimental condition (e.g., syllogisms with believable conclusions), the distortions caused by aggregating heterogeneous participants can lead to an underestimation of discriminability.
Such underestimation of discriminability is especially pernicious when different experimental conditions are used (e.g., syllogisms with believable versus unbelievable conclusions, or reasoning under a fixedtime limit versus selfpaced conditions, etc), as it can create spurious differences or mask real ones. For example, individuals might be better at discriminating syllogisms with believable conclusions than their unbelievable counterparts (cf., Guyote & Sternberg, 1981). But if the interindividual variability in terms of the adopted response criteria is larger in the former than in the latter, then the resulting underestimation can mask the differences in discriminability. Alternatively, if discriminability is the same across two conditions, differences in terms of the interindividual variability of response criteria can introduce spurious differences in the estimates obtained with the aggregate data. It is possible that some inconsistencies found in the literature (e.g., Dube et al., 2010; Klauer et al., 2000) are driven by this. For instance, Trippas et al., (2013), who also employed the SDT model, observed no effect of believability on discriminability only for participants of lower cognitive ability, with higher ability reasoners showing a more typical effect of beliefs on accuracy. This suggests that treating all participants as equivalent is perhaps not the best assumption.
A hierarchical Bayesian metaanalytic approach
Fortunately, the problems associated with aggregation can be avoided by relying on hierarchical methods that take the heterogeneity at the participant and stimulus levels—logical structures in our case—into account (e.g., Baayen et al., 2008; Barr et al., 2013; Snijders & Bosker, 2012). Specifically, both participants and stimuli considered in the analyses are assumed to be random samples from higher grouplevel distributions, whose parameters are also estimated from the data. Note that when facing multiple studies, one can conceptualize each study as a random sample from a distribution of studies. Usually, each of these higher grouplevel distributions are assumed to follow a Gaussian distribution with some mean and variance. In the case of participantlevel differences, the mean of this grouplevel distribution captures the average individual parameter value whereas the variance expresses the variability observed across participants. An analogous interpretation holds for the grouplevel distributions from which stimuli are assumed to originate.
Our hierarchical extension of SDT was implemented in a Bayesian framework (Gelman et al., 2013; Carpenter et al., 2017). In a Bayesian framework, the information one has regarding the parameters is represented by probability distributions. We begin by establishing prior distributions that capture our current state of ignorance. These prior distributions are then updated in light of the data using Bayes’ theorem, resulting in posterior distributions that reflect a new state of knowledge (for an overview of hierarchical Bayesian approaches, see Lee & Wagenmakers, 2013; Rouder & Jun, 2005). The estimation of posterior parameter distributions can be conducted using Markov chain Monte Carlo methods (for an introduction, see Robert & Casella, 2009). In the present work, we employed Hamiltonian Monte Carlo (e.g., Monnahan et al., 2016 and relied on weakly informative or noninformative priors that imposed minimal constraints on the values taken on by the parameters. These prior constraints are quickly overrun by the information present in the data.
The information captured by the posterior parameter distributions can be conveniently summarized by their respective means and 95% (highestdensity) credible intervals. Each interval corresponds to the (smallest) region of values that include the true parameter value with probability .95. Moreover, the overall quality of a model can be checked by comparing the observed data with the predictions based on the model’s posterior parameter distributions (Gelman and Shalizi, 2013). If the observed data deviate substantially from the predictions then one can conclude that the model is failing to provide an adequate characterization.
Hierarchical extension of signaldetection model
As previously discussed, the SDT model characterizes individuals’ responses in terms of latent strength distributions defined with means \(\mu \), standard deviations \(\sigma \), and response criteria \(\tau \). We will therefore introduce our hierarchical extension of SDT at the level of these parameters (Klauer, 2010; Rouder & Jun, 2005; Morey et al., 2008; Pratte & Rouder, 2011; Pratte et al., 2010). Because of the identifiability issues associated with SDT (see Footnote 3), we modeled believable and unbelievable syllogisms separately.
Description of hierarchical linear model parameters and super/subscripts
Parameter  Meaning 

\(\bar {\mu }\)  Grand mean 
χ  Study effect 
ξ  Person effect 
η  Item effect 
Super/Subscript  Meaning 
V/I  Valid/invalid 
h  Study 
p _{ h}  Participant in Study h 
s  Syllogistic forms 
Metaanalytic model
For ease of presentation, the formulas in the previous paragraph present a slight simplification of our actual model. For all displacement parameters, \({\chi ^{x}_{h}}\) (studyspecific), \(\xi ^{x}_{P_{h}}\) (participantlevel), and \({\eta ^{x}_{s}}\) (stimulus/itemspecific) we also estimated the correlation among the deviations across the different SDT parameters x. Thus, all displacements are actually assumed to come from a zerocentered multivariate Gaussian distributions with covariance matrices \(\mathbf {{\Sigma }}_{S}\), \(\mathbf {{\Sigma }}_{P}\), and \(\mathbf {{\Sigma }}_{I}\), respectively (Klauer, 2010). For the covariance matrices \(\mathbf {{\Sigma }}_{S}\) and \(\mathbf {{\Sigma }}_{P}\) the standard deviations are as described in the previous paragraph and we additionally estimated one correlation matrix for each covariance matrix. For \({\eta ^{x}_{s}}\) we estimated one standard deviation for each x and one correlation matrix. The complete model is presented in the Appendix. The covariance matrices capture different dependencies that could be potentially found across participants’ parameter estimates. For instance, the participantlevel covariance matrix \(\mathbf {{\Sigma }}_{P}\) indicates how individual parameters, say \(\mu _{V}\) and \(\sigma _{V}\), covary across participants. The estimation of all these covariance matrices, which amount to a socalled “maximal randomeffects structure” is strongly advised as it known to improve the generalizability and accuracy of the hierarchical model’s account of the data (Barr et al., 2013): Specifically, the hierarchical structure of the model’s parameters allows us to more safely make generalizations from our parameters of interest. For example, the grouplevel means (e.g., \({\bar \sigma }_{V}\)) summarize the information that we have about the individuals, after factoring out their differences. These parameters allow us then to make general inferences regarding the population, such as whether \(\sigma _{V}\) is systematically greater than \(\sigma _{I}\), as currently claimed in literature (Dube et al., 2010; Heit & Rotello, 2014).
The extension of this model to the case of a Kpoint confidencerating paradigm follows exactly what is already described in Eqs. 11 and 12, with the specification of \(K1\) ordered response criteria \(\tau _{h,p_{h},k}\) per participant. The use of a different set of criteria per participant allows the model to capture different response styles that people often manifest (Tourangeau et al., 2000). As previously mentioned, it is customary to fix the parameters of the invalidsyllogism distributions, but in the present case we decided to instead fix \(\tau _{h,p_{h},1}\) and \(\tau _{h,p_{h},{K1}}\) to 0 and 1, respectively. This restriction, which does not affect the ability of the model to account for ROC data, nor the interpretation of the parameters, implies that the mean and standard deviation parameters from all argumentstrength distributions are freely estimated (for a similar approach, see Morey et al., 2008). The motivation behind the use of this particular set of parameter restrictions was that it provided a more convenient specification of the different sets of participant, stimulus, and grouplevel parameters and at the same time allowed for identical prior distributions (see below) for the two standard deviations \(\sigma _{V}\) and \(\sigma _{I}\), which are of interest here. Furthermore, we assumed that the remaining three response criteria per individual participant, \(\tau _{h,p_{h},2}\) to \(\tau _{h,p_{h},{K2}}\), were each drawn from a separate grouplevel Gaussian distribution and then transformed on the unit scale using the cumulative distribution function of the standard Gaussian distribution. The sampling was performed such that the three tobeestimated criteria per individual participant were ordered.^{8}
In line with the literature (e.g., Dube et al., 2010; Trippas et al., 2013), we modeled the data for believable and unbelievable syllogisms separately using the same model. The reason for modeling these data separately is that SDT does not yield identifiable parameters (i.e., infinitely many sets of parameter values produce the exact same predictions; see Bamber & van Santen, 2000; Moran, 2016) when parameter restrictions are only applied on the parameters concerning one stimulus type (e.g., believable syllogisms) and everything else is left to be freely estimated (e.g., different response criteria for believable and unbelievable syllogisms). However, applying restrictions to each stimulus type while allowing criteria to vary freely between them is equivalent to fitting them separately (for detailed discussions; see Singmann, 2014; Wickens & Hirshman, 2000).
Metaanalysis of extant ROC data
Our analysis differs from regular metaanalyses (e.g., Borenstein et al., 2010) in two important ways. First, we obtained the raw (i.e., participant and triallevel) data and performed our metaanalysis on this nonaggregated data. This has the benefit that all variability estimates are obtained directly from the data and not inferred from other statistical indices. Second, our metaanalysis is performed using a fully generative model; it allows us to use the obtained parameter estimates to generate new synthetic data from for any part of the data corpus (e.g., for individual participants or studies). The data corpus and modeling scripts are available at: https://osf.io/8dfyv/.
Description of the data corpus
Study ID  N participants  N trials  Study 

1  44  16  Trippas et al., (2013), Exp. 1, complexsyllogism condition 
2  47  16  Trippas et al., (2013), Exp. 1, simplesyllogism condition 
3  44  16  Trippas (2013), Exp. 6, no time limit 
4  42  16  Trippas (2013), Exp. 6, 10s time limit 
5  32  16  Trippas (2013), Exp. 7, deductive instructions 
6  34  16  Trippas (2013), Exp. 7, weak instructions 
7  36  16  Trippas et al., (2013), Exp. 2, 10s time limit, IQ 
8  49  16  Trippas et al., (2013), Exp. 2, no time limit, IQ 
9  45  8  Trippas (unpublished), complexsyllogisms, internal replication 
10  38  16  Trippas et al., (2014), fluentfont condition 
11  38  16  Trippas et al., (2014), disfluentfont condition 
12  42  8  Nuobaraite (2013 dissertation), egodepletion 
13  24  8  Trippas (unpublished), complexsyllogisms, debias instructions 
14  191  16  Trippas et al., (2015), individual differences 
15  38  8  Dube et al., (2010), Exp. 2, complexsyllogisms 
16  21  16  Dube et al., (2010), Exp. 3, conservative condition 
17  24  16  Dube et al., (2010), Exp. 3, neutral condition 
18  27  16  Dube et al., (2010), Exp. 3, liberal condition 
19  45  8  Heit and Rotello (2014), Exp. 1, augmented instructions 
20  44  8  Heit and Rotello (2014), Exp. 1, standard instructions 
21  44  8  Heit and Rotello (2014), Exp. 2, conservative instructions 
22  44  8  Heit and Rotello (2014), Exp. 2, standard instructions 
In terms of stimulus differences, we considered the different forms that syllogisms can take on. A categorical syllogism is an argument which consists of three terms, denoted here by A, B, and C, which are combined in two premises to produce a conclusion. The two terms which are present in the conclusion, A and C, are referred to as the end terms. The term which is present in each premise is referred to as the middle term, is denoted B. For example, in the “rose syllogism” given earlier, A = roses, B = petals, C = flowers. The two premises and conclusion each include one of four quantifiers: Universal affirmative (A; e.g., All A are B), universal negative (E; e.g., No A are B), particular affirmative (I; Some A are B), and particular negative (O; e.g., Some A are not B). The logical validity of a syllogistic structure is defined by its mood, its figure, and the direction of the terms in the conclusion. The mood is a description of which quantifiers occur in the syllogism. For instance, if the premises and the conclusion are preceded by the quantifiers “All”, “Some”, and “No”, respectively, then the syllogism’s mood is AIE. Given that a syllogism consists of three statements and that there are four possible quantifiers for each statement, there are 64 possible moods. The figure denotes how the terms in the conclusion are ordered. There are four possible figures: 1: (AB; BC), 2: (BA; CB), 3: (AB, CB), 4: (BA; BC).^{9} Finally, there are two possible conclusion directions: 1: (AC) and 2: (CA). Combining the 64 moods with the four figures and the two conclusion directions yields a total of 512 possible syllogisms, of which only 27 are logically valid (Evans et al., 1999). The combinations of form and figure in syllogisms can be conveniently coded by concatenating the two letters associated to the quantifiers of the premises, the number associated with the figure, the letter associated with the quantifier of the conclusion and the direction of the conclusion. The “rose syllogism” used earlier as an example would be coded as AA3_A2: both premises and conclusion start with the “All” quantifier, the syllogistic figure is 3, and the conclusion direction is 2—from C to A. A complete list of all the syllogistic figures used in the reanalyzed studies and their respective codes is included our supplemental material is hosted on the Open Science Framework (OSF). Specifically at: https://osf.io/8dfyv/.
Results
With regards to the necessity of a hierarchical account, we inspected the posterior estimates of the variability parameters of the participant (\({\bar \sigma _{\xi ^{x}}}\)), stimulus (\(\sigma _{{\eta ^{x}_{s}}}\)), and studyeffects (υ) of the different SDT parameters. All of these variability parameters clearly deviated from zero (i.e., their 95% credible intervals do not include 0), indicating the presence of heterogeneity among participants, believable and unbelievable syllogisms, and studies. As discussed in detail by Smith and Batchelder (2008), the presence of such heterogeneity indicates the need for a hierarchical framework that does not rely on data aggregation.
In order to quantify the general degree of support for the EVSDT obtained from the posterior \(\sigma _{V}\) and \(\sigma _{I}\) estimates, we computed Bayes factors (BF; Kass & Raftery, 1995) that quantified the evidence in favor of EVSDT versus an unconstrained SDT model. In this specific case, the constrained EVSDT model was represented by the null hypothesis \(\mathcal {H}_{0}\) stating that the grouplevel \(\frac {\sigma _{V}}{\sigma _{I}}\) can take a small range of values, between .99 and 1.01, and an encompassing alternative hypothesis \(\mathcal {H}_{A}\) that imposed no such constraint.^{10} In typical settings, the use of Bayes Factors requires the computation of marginal likelihoods for (at least) two models, which can be quite challenging (but see Gronau et al., 2017). But in this specific case in which the hypotheses considered consist of nested ranges of admissible parameter values (specifically, the range of \(\frac {\sigma _{V}}{\sigma _{I}}\)), Bayes Factors can be easily computed. As shown by Klugkist and Hoijtink (2007), the Bayes Factor for the two nested hypothesis corresponds to ratio of probabilities: The posterior probability that \(.99 < \frac {\sigma _{V}}{\sigma _{I}} < 1.01\), and its prior counterpart. The obtained Bayes factors were 17.28 and 11.84 for believable and unbelievable syllogisms, which indicates that the posterior probability of \(\frac {\sigma _{V}}{\sigma _{I}}\) values very close to 1 were 17 and 11 times greater after observing the data than before. According to the classification suggested by Vandekerckhove et al., (2015), this indicates strong support for \(\mathcal {H}_{0}\).
Figure 10 also allows us to compare our results to the metaanalysis of Khemlani and JohnsonLaird (2012). In contrast to the data considered here in which participants are presented with both premises and conclusion, they focused on data from the conclusion generation task. In this task participants are only provided with the premises and requested to create a possible conclusion or indicate that no conclusion follows. For the valid forms, our data are somewhat in line with their findings. The valid syllogisms that showed a clearly reduced discriminatibility with \(\eta ^{\mu }_{s_{v}} < 0\), EI4_O2 and EI1_O2, also were among the most difficult according to Khemlani and JohnsonLaird (2012). Out of the 64 syllogistic forms their difficulty ranks (where 1 = easiest and 64 = most difficult) were 55 and 61, respectively. Interestingly, no such consistency can be found in the case of the invalid syllogisms: The two forms that clearly showed a reduced discriminability, OE4_O2 and EO4_O1, were relatively easy with ranks of 22 and 15, respectively. However, OE3_O1, which showed an increased discriminability in our study, \(\eta ^{\mu }_{s_{i}} > 0\) was even slightly more difficult in the generation task with a rank of 26. These results reinforce the notion that the conclusion evaluation task and the conclusion generation task do not appear to involve the exact same cognitive processes. This would appear to carry additional implications for the mental models approach, beyond its seemingly faulty prediction of an effect of belief on reasoning, since much of the data used to develop the mental models theory of conclusion evaluation tasks was obtained using the production task. Furthermore, the model’s assumption that evaluation is implicit production is also questioned by these results.
Validity checks
In this section, we will discuss different ways in which we attempted to corroborate our results. We relied on different approaches such as prior sensitivity analysis, assessing the impact of aggregation biases, and parameter recovery simulations. As discussed in detail below, all of the results support the conclusions from our metaanalysis.
Prior sensitivity analysis
Validity Checks
Model  Parameter / Derived Measure  

\(\bar {\mu }_{V}\) Believable  \(\bar {\mu }_{I}\) Believable  \(\bar {\mu }_{V}\) Unbelievable  \(\bar {\mu }_{I}\) Unbelievable  
Original  1.05 [.97, 1.14]  .62 [.54, .71]  .79 [.72, .86]  .32 [.25, .38] 
Alternative  1.05 [.97, 1.13]  .62 [.54, .70]  .78 [.72, .85]  .32 [.25, .39] 
No \({\eta ^{x}_{s}}\)  1.05 [.98, 1.14]  .63 [.56, .71]  .78 [.72, .85]  .33 [.27, .39] 
No \(\xi ^{x}\), no \({\eta ^{x}_{s}}\)  .98 [.92, 1.05]  .61 [.53, .68]  .76 [.69, .82]  .34 [.29, .40] 
\(\frac {\bar {\sigma }_{V}}{\bar {\sigma }_{I}} = 1.50\)  1.05 [.98, 1.12]  .61 [.52, .70]  .77 [.70, .84]  .32 [.27, .38] 
\(\bar {\sigma }_{V}\) Believable  \(\bar {\sigma }_{I}\) Believable  \(\bar {\sigma }_{V}\) Unbelievable  \(\bar {\sigma }_{I}\) Unbelievable  
Original  .50 [.46, .55]  .47 [.43, .53]  .51 [.45, .58]  .47 [.41, .54] 
Alternative  .50 [.46, .54]  .47 [.43, .52]  .51 [.45, .57]  .47 [.41, .53] 
No \({\eta ^{x}_{s}}\)  .50 [.46, .55]  .48 [.44, .53]  .52 [.46, .58]  .49 [.44, .55] 
No \(\xi ^{x}\), no \({\eta ^{x}_{s}}\)  .58 [.53, .62]  .56 [.52, .60]  .68 [.61, .76]  .56 [.52, .61] 
\(\frac {\bar {\sigma }_{V}}{\bar {\sigma }_{I}} = 1.50\)  .70 [.65, .76]  .48 [.43, .52]  .69 [.62, .76]  .47 [.41, .53] 
\(\frac {\bar {\sigma }_{V}}{\bar {\sigma }_{I}}\) Believable  \(\frac {\bar {\sigma }_{V}}{\bar {\sigma }_{I}}\) Unbelievable  \(d_{a}\) Believable  \(d_{a}\) Unbelievable  
Original  1.06 [.95, 1.17]  1.09 [.96, 1.23]  .62 [.46, .78]  .67 [.52, .83] 
Alternative  1.06 [.95, 1.17]  1.09 [.95, 1.23]  .63 [.47, .78]  .67 [.52, .83] 
No \({\eta ^{x}_{s}}\)  1.05 [.97, 1.12]  1.05 [.98, 1.13]  .60 [.47, .74]  .64 [.50, .78] 
No \(\xi ^{x}\), no \({\eta ^{x}_{s}}\)  1.03 [.99, 1.08]  1.21 [1.14, 1.28]  .46 [.35, .58]  .47 [.35, .59] 
\(\frac {\bar {\sigma }_{V}}{\bar {\sigma }_{I}} = 1.50\)  1.48 [1.35, 1.61]  1.45 [1.30, 1.62]  .51 [.39, .64]  .54 [.42, .67] 
Effects of data aggregation
In the first part of this report, we provided several theoretical and simulationbased arguments illustrating why data aggregation can lead to biased conclusions. We now address this question empirically by reanalyzing our data corpus with models in which we purposefully omitted potential sources of variability, such as stimuli or participants. Given these concerns, it is interesting to see the extent to which aggregation actually affects results. For example, Pratte et al., (2010) found, in the context of recognition memory, that aggregation biases did not ultimately affect the observation of asymmetric ROCs. This outcome suggests that data aggregation may not be problematic as typically portrayed. Does a similar situation hold here? To find out, we checked whether we found evidence against the EVSDT when aggregating across the different sources of variability. In the first of those reanalysis we did not include stimulusspecific differences and aggregated the data within participants (model “no \({\eta ^{x}_{s}}\)”). This model resulted in parameter estimates that were nearly identical to those of the original model (see Table 4), in line with the earlier observation that the stimulusspecific effects were rather modest. However, the confidence bands for \(\frac {\sigma _{V}}{\sigma _{I}}\) were markedly narrower when compared with the original model. This result indicates that data aggregation can affect parameter estimates by attributing them an unwarranted degree of certainty.
In the second reanalysis we only analyzed the data aggregated on the study level, ignoring both a stimulusspecific effect as well as a participantspecific effect (model “no \(\xi ^{x}\), no \({\eta ^{x}_{s}}\)”).^{12} This is the analysis most often performed in previous work on reasoning ROCs (e.g., Dube et al., 2010, although we employed a Bayesian approach here as well). For this model (see Table 4) we now find rather strong differences relative to the other variants. Furthermore, for the unbelievable syllogisms we now find clear evidence against EVSDT with \(\frac {\sigma _{V}}{\sigma _{I}} = 1.21\) [1.14, 1.28].
Taken together, these reanalyses reinforce two important points. First, ignoring random variability that is part of the data can lead to aggregation artifacts such as evidence for the unconstrained SDT model although the simpler EVSDT model is in fact more likely to be the datagenerating model. This also explains why earlier studies found such evidence. Second, even in cases in which the random variability does not distort the parameter estimates in dramatic ways it can still lead to estimates purporting a precision that is not actually warranted by the data. Both of these results reinforce the dictum of Barr et al., (2013): always employ the maximal randomeffects structure justified by the design (see also Schielzeth & Forstmeier, 2009).
Parameterrecovery simulation
In the second step we evaluated our ability to recover model parameters. The idea here is that we should be confident about our results only if we can demonstrate that our hierarchical Bayesian SDT model can recover the datagenerating parameters. Specifically, we evaluated our ability to recover parameters when the generated data are not in line with the EVSDT, with \(\frac {\sigma _{V}}{\sigma _{I}} = 1.50\) (a value that is in line with the estimates obtained in other domains; e.g., Starns et al., 2012). In this simulation, we relied on the parameter estimates obtained from the present metaanalysis in order to have realistic individual parameter values. Specifically, we generated one data set identical in size to the original data from the parameter estimates obtained from the original model with the sole difference that \(\sigma _{V} = 1.5 \times \sigma _{I}\) and then used the original model to fit the data. We were able to recover parameter estimates, which were at odds with the EVSDT. Table 4 reports the results obtained from the grouplevel estimates, which are close to the datagenerating parameters (compared with the parameter estimates obtained in the metaanalysis, also reported in Table 4), reinforcing our trust in the present results. These results also dismiss the concern that the ROC datasets have limited diagnostic value, as some of them appear to only cover some of the possible range of hits and falsealarm values. If the data were not diagnostic for detecting asymmetries, then the present recovery of the \(\frac {\sigma _{V}}{\sigma _{I}}\) ratio would have not been expected.
Having established that our metaanalytic results are trustworthy and the data diagnostic, we now present that data from an experiment featuring a critical test of our main novel finding: ROC symmetry.
A critical test of ROC symmetry
So far, we have estimated the shape of ROC data on the sole basis of participants’ confidencerating judgments. An exclusive reliance on such data may be problematic: it is possible that researchers relying on a single type of data can fall victim to monooperation biases (Shadish et al., 2002, Chap. 3). Indeed, there is the question of whether ROCs obtained with confidence ratings match ROCs obtained with other methods (e.g., responsebias or payoff manipulations; see Klauer & Kellen, 2010; Klauer, 2011; Kellen et al., 2013). Furthermore, it has been suggested that the mere act of collecting confidence ratings may critically alter the decision process (Malmberg, 2002). Ideally, one seek converging evidence for the metaanalytic results supporting ROC symmetry with novel experimental data coming from alternative experimental paradigms to provide converging evidence.
One approach would consist of collecting ROC data without relying on confidencerating judgments but instead use response bias or payoff manipulations. This approach is in many ways problematic: on a practical level, participants tend to be quite conservative when it comes to shifting their response criteria across responsebias conditions, leading to ROC points that are too close to evaluate the overall shape of the ROC (e.g., Dube & Rotello, 2012). On a theoretical level, there is a risk that individuals do not maintain the same level of discriminability across response bias conditions, compromising ROC analysis (which assumes that discriminability remains constant; see Balakrishnan, 1999; Bröder & Malejka, 2016; Van Zandt, 2000).
In order to sidestep these issues, we conducted a critical test of ROC symmetry that capitalizes on an overlooked property of SDT that was originally established by Iverson and Bamber (1997). In a result known as the Generalized Area Theorem, Iverson and Bamber showed that the ROC function of a decision maker can be characterized by his/her performance across different Malternative forcedchoice trials in which one tries to identify the target stimulus (e.g., the valid syllogism) among M1 lure stimuli (e.g., invalid syllogisms). Specifically, the proportion of correct responses in a Malternative forcedchoice (MAFC) task corresponds to the Mth moment of the ROC function (for a detailed discussion, see Kellen, 2018). This result is completely nonparametric as it does not hinge on the latent distributions taking on a specific parametric form (i.e., the distributions do not have to be Gaussian). The Area Theorem popularized by Green (see Green & Moses, 1966), which states that the proportion of correct responses in 2AFC task corresponds to the area under the ROC function (i.e., the function’s expected value or first moment), is an instance of the Generalized Area Theorem.
Iverson and Bamber (1997) showed that the generalized area theorem also enabled ROC symmetry to be tested on the basis of Malternative forcedchoice judgments: consider a complementary forcedchoice task, designed here as MC AFC, in which the decision maker is requested to identify the lure stimulus among \(M1\) target stimuli. For example, in a 4AFC task the decision maker is presented with three invalid syllogisms and one valid syllogism and has to pick the valid one, whereas in the 4C AFC the decision maker is presented with one invalid syllogism and three valid ones and has to pick the invalid one. It can be shown that an ROC function is symmetric (Killeen & Taylor, 2004) if and only if, for all M, the proportions of correct judgments in MAFC and MCAFC tasks are the same (for details, see Iverson & Bamber, 1997).
Method
Participants
We collected data in an online webbased study advertised on Amazon Mechanical Turk with a predetermined stopping rule of 125 participants. Participants were paid 1.25 USD for their participation, which took approximately 20 min. Ethical approval for the study was granted by the Office of Research Ethics at the University of Waterloo, Canada.
Procedure
Given the possibility for online data to be more noisy than the equivalent lab data, we built in a number of checks to ensure the data quality was sufficiently high. Upon agreeing to participate in the experiment, an informed consent page was presented. After providing informed consent by clicking a button saying “I Agree”, the following instructions were presented:
Participants who did not correctly answer the control question within five attempts were not allowed to participate in the study (they were still paid). Participants who correctly answered the control question were presented with the next set of instructions, which read:
We tested the symmetry assumption in syllogistic reasoning using MAFC and a MCAFC tasks for M = 2, 3, and 4. The participants were given 24 forcedchoice trials containing two, three, or four abstract syllogisms sidebyside (M was manipulated within participants), either under instructions to choose the valid argument (MAFC task) or under instructions to choose the invalid argument (MCAFC task), in a blocked and counterbalanced design (four trials per cell of the design). In contrast with the data used in the metaanalysis, we did not manipulate the believability of the conclusions (for an application of 2AFC to the study of belief bias, see Trippas et al., 2014).
Results
The individual choice data were analyzed with a hierarchical Bayesian probitregression model that included the main effects of “number of alternatives” (two, three, or four) and “choice focus” (choose target or lure item), as well as their interaction. Weaklyinformative priors were set for all effects, with a normal distribution with mean 0 and standard deviation 4 and 16 being assigned to the intercept and slope coefficients, respectively. Here, our interest lies in whether there is a robust effect of “choice focus” (if there is, then the ROC is asymmetrical). When attempting to choose the invalid syllogism, the grouplevel estimates of correctchoice probabilities were .60 [.55, .65], .43 [.38, .49], and .34 [.29, .39] for M \(=\) 2, 3, and 4, respectively. When attempting to choose the valid syllogism, the analogous estimates were .64 [.58, .69], .43 [.38, .49], and .38 [.32, .43]. Both sets of estimates appear to be similar, in line with the notion that ROCs are symmetrical. Indeed, the main effect of “choice focus” was merely .03 [.08, .02]. We computed a Bayes factor that quantified the relative evidence in favor of the null hypothesis that the latter effect is zero (in contrast with the alternative hypothesis that it is not zero). The obtained value was 69.72, which indicates very strong evidence in favor of the null hypothesis. Overall, the results show that our argument for ROC symmetry does not exclusively hinge on data from confidencerating paradigms, dismissing the notion of a monooperation bias in our metaanalytic results. More importantly, they provide converging evidence using a novel paradigm, suggesting that the equal variance SDT model is an appropriate model for belief bias in syllogistic reasoning. We discuss the implications of this experiment and the metaanalysis in next section.
Discussion
We can extract two takehome messages from the metaanalysis and critical experimental test: (1) judgments in syllogistic reasoning seem to be well accounted by the EVSDT model, which in turn is equivalent to a probitregression model. (2) Individuals show the same discriminability between valid and invalid syllogisms for believable and unbelievable syllogisms. These two results have serious implications on an empirical, methodological, and theoretical level. On an empirical level, the fact that the EVSDT model can be applied to binary judgments means that one can safely revisit a large body of work, as long as participant and stimuluslevel differences are taken into account. EVSDT appears to fail when performance is at ceiling (e.g., Study 2), but such performance levels are very far from what is typically observed in syllogistic reasoning studies, in which many errors are made, and the focus is placed on the nature of such errors (e.g., Khemlani & JohnsonLaird, 2012). Altogether, the routine collection of confidence ratings does not seem necessary for the appropriate measurement of belief bias–though we hasten to add that doing so could certainly be of interest from a metacognitive perspective (Ackerman & Thompson, 2015; Thompson et al., 2011). Finally, on a theoretical level, the results seem to corroborate Dube et al., (2010) in the sense that the lack of an effect of believability on discriminability is at odds with nearly all extant theories of syllogistic reasoning. At least as long as one does not take further individual characteristics into account as done below.
Metaanalyses are typically conducted with the goal of obtaining a “final word” on a given subject. In the present case, we reject such a view. Instead, we believe that our results should be framed as establishing a new starting point for research on syllogistic reasoning. This starting point involves the incorporation of some important facts: The exact way in which we relate data and theoretical constructs matters. Differences across studies, participants, and stimuli matter. That ignoring any of the latter should be seen as dangerous and misinformative. Based on this standpoint, we will dedicate the remainder of this paper to the discussion of how one can build upon the present work and develop better and more comprehensive characterizations of deductive reasoning.
Relating individual reasoning abilities and theories of belief bias
The hierarchical Bayesian SDT approach used here incorporates many stateoftheart methods that deal with different confounds such as the heterogeneity found at the level of participants and stimuli. At this point, we do not see how one could significantly improve upon the present approach based on the available data alone. But despite the merits of such an approach, we believe that some important limitations still need to be addressed. Chief among them is the fact that although the model can capture individual differences, it is completely silent regarding any of the factors that underlie them. Given the considerable body of work showing that different groups of individuals attempt to reason in qualitatively distinct ways (e.g., Stupple et al., 2011), it is extremely likely that the inclusion of additional individuallevel information might reveal new patterns and insights that have so far only been investigated using the SDT model applied to aggregate data (Trippas et al., 2013, 2015, 2014. In particular, these studies suggest that the addition of idiographic information might lead to a reframing of current theories of syllogistic reasoning rather than the strong dismissal suggested by the lack of an effect of believability on reasoning accuracy reported here.
Let us entertain the hypothesis that a sample of participants is comprised of elements from two groups, M and T: Group M consists of people who reason in accordance to the mentalmodel theory (Oakhill et al., 1989) given their stronger tendency to manifest an analytic cognitive style (e.g., Pennycook et al., 2015). By reasoning in accordance to the principles of the mentalmodel theory, they will typically reason better for unbelievable syllogisms, as these conclusions will trigger a search for counterexamples. Group T is made up of participants, who by having a lower tendency to manifest an analytic cognitive style, tend to reason in accordance to the transitivechain theory (Guyote and Sternberg, 1981). These people are then expected to reason worse for unbelievable syllogisms than for believable ones, as the unbelievable contents are more challenging to manipulate mentally. Analyzing data from such an experiment under the assumption that everybody amounts to some variation of the same reasoning strategy is likely to yield the incorrect conclusion that beliefs do not affect discriminability (as the differences in discriminability found in both groups can cancel each other out), in line with Dube et al.’s (2010) account.
This example can be made more concrete by reanalyzing Study 14 (Trippas et al., 2015), a large sample study (N = 191) in which additional individual information was available for 182 participants in the form of the Cognitive Reflection Test (CRT; Frederick, 2005). The CRT is a test which consists of three simple but surprisingly tricky problems which have been shown to capture individual differences in analytic cognitive style—that is, the degree to which a participant tends to engage in analytical thought (Pennycook et al., 2016; Toplak et al., 2011). As an example, consider the following question from the CRT (the widgets problem): “if a factory with 100 workers produces 100 widgets in 100 days, how many days would it take for 5 workers to produce 5 widgets?”. The intuitive response (based on a matchingheuristic) is “5 days”. However, the correct response is in fact “100 days”—after all, the problem premise entails that it takes 1 worker 100 days to produce 1 widget. We classified people who responded correctly to at least one problem as part of the “analytic” group (N = 111). People who responded incorrectly to all three problems were classified as part of the “intuitive” group (N = 71).
In our view, the statistical model used in this reanalysis should be considered as the new standard in analyzing endorsement rates in syllogistic reasoning: (1) It respects the nature of the data (categorical responses), (2) it is based on a validated EVSDT model, (3) it takes into account the heterogeneity found across participants and stimuli, and (4) it can be easily extended to include additional covariates. This model can also be conveniently implemented by researchers. Here, we relied on the R package rstanarm (Gabry & Goodrich, 2016). Appendix B provides details on how the model is specified (a complete script along with data can be found in our supplemental material is hosted on the Open Science Framework (OSF). Specifically at: https://osf.io/8dfyv/).
Beyond pure (SDT) model and singletask approaches
Throughout this manuscript, we exclusively relied on the SDT model framework. However, this is not the only approach that could be successfully adopted. For instance, many researchers often rely on discretestate models based on multinomial processing trees (for an overview, see Batchelder & Riefer, 1999; Erdfelder et al., 2009). Instead of describing responses in terms of continuous latent representations (e.g., distributions on an argumentstrength scale), these assume that responses are produced by a finite mixture of discrete cognitive states that are entered probabilistically. For example, Klauer et al. (2000) considered a discretestate model in which the true logical status of a valid syllogism is detected with a certain probability (e.g., probability \(D_{v}\)), a state in which a correct judgment was invariably made. When the logical status of a valid syllogism is not detected (e.g., with probability \(1D_{v}\)), the model assumes that individuals simply guess whether the syllogism is valid or invalid (with probabilities g and \(1g\), respectively). By testing detection probabilities and guessing biases across different types of syllogisms and experimental conditions, Klauer et al. were able to establish a testbed for the predictions of many different models of syllogistic reasoning.
Several successful discretestate approaches can be found in the reasoning literature, outside of the context of the beliefbias effect discussed here (Böckenholt, 2012b; Campitelli & Gerrans, 2014; Oberauer, 2006; Oberauer et al., 2006; Klauer et al., 2007; Krauth, 1982). For example, Klauer et al., (2007) developed a discretestate model for the classic Wason selection task (Wason, 1966), which requires participants to decide which of four cards needs to be flipped in order to test a given rule (“If there is an A on the letter side, then there is a 3 on the number side”). This discretestate model establishes how the observed responses (among the 16 possible combinations of card turns) can result from different interpretations of the rule (e.g., conditional versus biconditional interpretation), the types of inferences considered (forward versus backward), and their perceived sufficiency or necessity (see also Oberauer, 2006). Another example worth mentioning is the cognitivemiser model originally proposed by Böckenholt (2012b) and further developed by Campitelli and Gerrans (2014). This model, which is used to characterize responses from an extended version of the CRT, allows for the estimation of thinking dispositions and mathematical abilities by establishing parameters reflecting the probability of successful response inhibition and deliberative processing being engaged.
There is a decadelong debate among SDT and discretestate modelers on the relative merits of the two approaches in several psychological domains (Batchelder et al.,, 1994; Dube & Rotello, 2012; Dube et al.,, 2010, 2012; Kellen & Klauer, 2011; Kellen, 2014; Kellen et al.,, 2013, 2015; Kinchla, 1994); for reviews, see Pazzaglia et al., (2013), Batchelder and Alexander (2013), and Dube et al., (2013).^{13} From this heated debate, two constructive points are often overlooked: First, there is some consensus that the two modeling approaches seem to be particularly successful in certain types of domains and paradigms. For instance, discretestate approaches allow for a more clear separation between mental states and their mapping onto observed responses, which has enabled researchers to develop a wide range of methods to account for individual differences in response styles (see Böckenholt, 2012a; Klauer & Kellen, 2010). Second, the two modeling approaches can be conveniently integrated in order to create hybrid models that simultaneously account for different kinds of data. As pointed out by Klauer and Kellen (2011), the parameters expressing the probability of different discrete states being entered can be easily specified as a function of continuous distributions like the ones postulated by SDT (see also Klauer, 2010).
A combination of these modeling approaches, particularly when done in a hierarchical Bayesian fashion, opens very promising avenues of research. For instance, one can integrate the cognitivemiser and SDT models in order to further explore the relationships between different reasoning theories and the beliefbias effect. Moreover, one can develop hybrid models that bridge the gap between different types of data that are relevant for theories of syllogistic reasoning. For example, Khemlani and JohnsonLaird (2012) tested a large set of models of syllogistic reasoning using data from a conclusion generation task in which participants attempted to produce a conclusion from a given pair of premises. The categorical data coming from this task (note that participants can produce many types of conclusions) could be conveniently modeled by means of discrete states. It would be interesting to try to link the parameters describing the probabilities of such states being entered with the argumentstrength distributions that underlie the SDT modeling of endorsement rates. The joint modeling of both tasks simultaneously could help researchers to better understand the general and taskspecific aspects of the data (e.g., the previously discussed fact that the difficulty of invalid syllogisms appears to differ between tasks). These jointmodeling efforts seem particularly important when considering the recent efforts to integrate different reasoning abilities within a single framework (e.g., Stanovich et al., 2016; Thompson, 2000).
Playing a more ambitious game
Allen Newell famously stated that one cannot hope to play “20 Questions” with Nature and win. Khemlani and JohnsonLaird (2012) faced such a humbling situation when failing to find a theory that successfully accounted for 64 different syllogistic forms. The difficulties associated with describing the wide range of syllogisms available has led many researchers to focus their efforts on a few cases only. Despite its practical appeal, this strategy has led to the present case in which the 22 reanalyzed datasets pretty much focused on 17 syllogistic forms. Another advantage of the hierarchical Bayesian SDT approach advocated here is that it allows for a characterization of the different syllogistic forms without any form of aggregation (note that Khemlani and JohnsonLaird (2012), relied on aggregate data) that can later guide us towards more comprehensive theories of syllogistic reasoning. In fact, one could in principle connect the SDT model with more finegrained computational theories by constraining the parameters of the former to be a function of the mechanisms of the latter (for examples in the context of recognition memory, see Brandt, 2007; Osth & Dennis, 2015).
Last but not least, future work should attempt to go beyond acceptance rates and incorporate the time take taken for making these judgments. For instance, one can rely on the driftdiffusion model (e.g., Ratcliff & Rouder, 1998), which can be seen as a dynamic extension of the SDT model used here. However, note that other options are available, including the use of a dynamic discretestate model approach (e.g., Klauer, 2018). Although response times have not played a significant role in this literature, they nevertheless introduce important theoretical constraints (e.g., Trippas et al., 2017). This state of affairs is partly due to the difficulties associated with fitting such models when individual data are sparse. But fortunately, some of these difficulties have been relaxed due to the development of hierarchical Bayesian extensions (e.g., Vandekerckhove et al., 2011).
Stop worrying about data sparseness and embrace partial pooling
As discussed earlier, one of the challenges experimental psychologists regularly face is the sparseness of data. One obvious way to ameliorate this sparseness is to maximize the number of responses per individual. However, the notion that more data is necessarily better is a dangerous one, especially when dealing with highercognitive faculties. For instance, there is the risk that the way individuals engage syllogisms depends on their expected workload (e.g., number of syllogisms to be evaluated) throughout the experiment. For example, the studies by Klauer et al., (2000) relied on a large number of participants evaluating a small number of syllogisms each (as small as eight syllogisms). In contrast, studies with the goal of obtaining ROC data, such as virtually all of the studies we considered, involved larger numbers ranging from 16 to 64 syllogisms. It is possible that this difference can explain to some degree the discrepancies found in these studies regarding the effect of conclusion believability on participants’ discriminability. When relying on a hierarchical Bayesian approach, one can avoid a maximization strategy by capitalizing on the principle of partial pooling—that the similarities among participants will inform the estimation of individuallevel parameters. The sparseness found at the individual level can be compensated for by a reliance on large participant samples that can be conveniently collected online, for example. The advantages of hierarchical Bayesian modeling would also hold in the case of incomplete experimental designs that attempt to sidestep time constraints, fatigue, learning, or carryover effects (Little and Rubin, 1997; Schafer, 1997). For example, partial pooling would improve parameter estimation in an experiment in which participants engage in different tasks and encounter different stimuli (e.g., Thompson, 2000), but not all participants engage in same set of tasks and/or encounter the same set of stimuli.
Footnotes
 1.
Dube et al., (2010) explain their results in terms of a criterionshift account (“it’s a response bias effect”). However, as shown in detail below and elsewhere (Wickens and Hirshman, 2000; Singmann & Kellen, 2013), this interpretation is not entirely justified due to an identifiability problem in their model. In the current article, we will therefore refrain from adopting this interpretation and only consider whether or not we find differences in discriminability between believable and unbelievable syllogisms.
 2.
It is worth noting that Klauer et al., (2000) also advocates the use of extended experimental designs that yield data that are to be fitted with an unconstrained model.
 3.
Recasting the EVSDT model as a probit regression model highlights an important identifiability issue in SDT. In \(2\times 2\) designs involving two pairs of distributions, SDT cannot distinguish between a shift in responses bias from a shift of a pair of distributions. Specifically, note that \(\beta _{B}\) can be understood as a shift in argument strength imposed on the distributions for believable valid and invalid syllogisms (for unbelievable syllogisms, \(\mu _{I} = 0\) and \(\mu _{V} = \beta _{L}\); for believable, \(\mu _{I} = \beta _{B}\) and \(\mu _{V} = \beta _{L} + \beta _{B}\)), or alternatively, interpreted as a shift of the response criterion (for unbelievable syllogisms, \(\tau = \beta _{0} \); for believable, \(\tau = \beta _{0} + \beta _{B}\)). For a detailed discussion on this issue, see Singmann (2014) and Wickens and Hirshman (2000). This identifiability constraint in SDT also implies that the interaction parameter \(\beta _{LB}\) captures changes in discriminability, as discussed above for the linear model.
 4.
We generated SDT predictions for a sixpoint ROC by establishing five equally spaced \(\tau \) criteria between 1.64 and 1.64 (these criteria lead to cumulative falsealarm rates ranging from .05 to .95 in equal steps). These predictions were then fitted with an SDT model using maximumlikelihood estimation (using the methods implemented in Singmann & Kellen, 2013).
 5.
Note that this linear model does not include the possibility of participant \(\times \) stimulus interactions. Such interactions cannot be estimated in the present context because we only have one participantstimulus pairing (Christensen, 2011). However, the absence of such correlations is not particularly troubling given that they are expected to have a reduced impact of parameter estimates (see Rouder et al., 2008).
 6.Because we had no prior knowledge about the distribution of the studyspecific variances, we assumed that the square roots of the variances (i.e., the standard deviations) follow a halfCauchy distribution with location \({\bar \sigma _{\xi ^{x}}}\) and scale \({\gamma _{\xi ^{x}}}\) (we preferred the Cauchy over the normal distribution here because of the fatter tails of the former):$$ \sigma_{\xi^{x},h} \sim \text{Cauchy}^{+}({\bar\sigma_{\xi^{x}}}, {\gamma_{\xi^{x}}}). $$(24)
 7.
In the metaanalytic literature the betweenstudy error variance is commonly referred to as \(\tau ^{2}\). As we use \(\tau \) to refer to a parameter of the signaldetection model we use \(\upsilon ^{2}\) to refer to the betweenstudy error variance.
 8.
For achieving model convergence (i.e., \(\hat {R}\) values below 1.05 and no socalled “divergent transitions” that can appear in Hamiltonian Monte Carlo) we had to fix the standard deviation of the Gaussian distribution for the middle criterion to a small value (i.e., .1), as it otherwise tried to converge on 0 (which is an impossible value for a standard deviation). This suggests that there was little variability in the position of the central response criterion delineating “valid” and “invalid” decisions across participants.
 9.
 10.
For reasons of numerical stability, we opted for testing a small range of values rather than a point estimate \(\left (\frac {\sigma _{V}}{\sigma _{I}} = 1\right )\).
 11.
In Fig. 10, we did not include the syllogisms presented in Study 2. Because these syllogisms were only included in this study, it is difficult to completely disentangle their effects (captured by \({\eta ^{x}_{s}}\)) from the observed studyspecific difference (parameter \(\chi ^{x}\)) associated with Study 2. These parameter estimates are reported in our supplemental material is hosted on the Open Science Framework (OSF). Specifically at: https://osf.io/8dfyv/.
 12.
Due to the absence of participantspecific effects (and thereby estimates of the withinstudy variability) this model does not implement a ‘randomeffects metaanalysis’. Instead, this is a simple hierarchical model with one multivariatenormal grouplevel distribution for the study effects.
 13.
One criticism is that the interaction index corresponds to a special case of a specific discretestate model, namely a restricted twohigh threshold model (Dube et al., 2010). However, it is important to keep in mind that the shortcomings of a specific discretestate model do not necessarily generalize to the class of discretestate models as whole. In fact, when a more appropriate twohighthreshold model that can account for ROC curvature is used Klauer and Kellen (2010), one obtains a characterization of the data that is similar to the SDT model’s (Klauer & Kellen, 2011).
Notes
Acknowledgments
We thank Evan Heit and Caren Rotello for providing us with raw data. Part of this work was presented at the International Conference on Thinking 2016. David Kellen and Henrik Singmann received support from the Swiss National Science Foundation Grant 100014_165591. The data and modeling scripts are available at: https://osf.io/8dfyv/. Open access funding provided by Max Planck Society.
References
 Ackerman, R., & Thompson, V.A. (2015). Metareasoning. What can we learn from metamemory? In Reasoning as memory (pp. 164–182). Feeney, Aidan, Thompson.Google Scholar
 Ahn, W.Y., Krawitz, A., Kim, W., Busemeyer, J. R., & Brown, J. W. (2011). A modelbased fMRI analysis with hierarchical Bayesian parameter estimation. Journal of Neuroscience, Psychology, and Economics. Methods in Decision Neuroscience, 4(2), 95–110. https://doi.org/10.1037/a0020684 CrossRefGoogle Scholar
 Anderson, R.B., & Tweney, R.D. (1997). Artifactual power curves in forgetting. Memory & Cognition, 25(5), 724–730. https://doi.org/10.3758/BF03211315 Google Scholar
 Baayen, H., Davidson, D.J., & Bates, D. (2008). Mixedeffects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59(4), 390–412. https://doi.org/10.1016/j.jml.2007.12.005 Google Scholar
 Balakrishnan, J.D. (1999). Decision processes in discrimination: Fundamental misrepresentations of signal detection theory. Journal of Experimental Psychology: HPP, 25(5), 1189–1206. https://doi.org/10.1037/00961523.25.5.1189 Google Scholar
 Ball, L.J., Phillips, P., Wade, C. N., & Quayle, J. D. (2006). Effects of belief and logic on syllogistic reasoning: Eyemovement evidence for selective processing models. English. Experimental Psychology, 53(1), 77–86.PubMedGoogle Scholar
 Bamber, D., & van Santen, J.P.H. (2000). How to assess a model’s testability and identifiability. Journal of Mathematical Psychology, 44(1), 20–40. https://doi.org/10.1006/jmps.1999.1275 PubMedGoogle Scholar
 Barr, D.J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278. https://doi.org/10.1016/j.jml.2012.11.001 Google Scholar
 Batchelder, W.H., & Alexander, G.E. (2013). Discretestate models: Comment on Pazzaglia, Dube, and Rotello (2013). Psychological Bulletin, 139, 1204–1212. https://doi.org/10.1037/a0033894 PubMedGoogle Scholar
 Batchelder, W.H., Riefer, D. M., & Hu, X. (1994). Measuring memory factors in source monitoring: Reply to Kinchla. Psychological Review, 101, 172–176. https://doi.org/10.1037//0033295X.101.1.172 https://doi.org/10.1037//0033295X.101.1.172 Google Scholar
 Batchelder, W.H., & Riefer, D.M. (1999). Theoretical and empirical review of multinomial process tree modeling. Psychonomic Bulletin & Review, 6(1), 57–86. https://doi.org/10.3758/BF03210812 Google Scholar
 Batson, D.C. (1975). Rational processing or rationalization? The effect of disconfirming information on a stated religious belief. English. Journal of Personality and Social Psychology, 32(1), 176–184. https://doi.org/10.1037/h0076771 Google Scholar
 Böckenholt, U. (2012). The cognitivemiser response model: Testing for intuitive and deliberate reasoning. Psychometrika, 77(2), 388–399. https://doi.org/10.1007/s113360129251y Google Scholar
 Böckenholt, U. (2012). Measuring response styles in likert items. Psychological Methods. https://doi.org/10.1037/met0000106 PubMedGoogle Scholar
 Borenstein, M., Hedges, L. V., Higgins, J. P., & Rothstein, H. R. (2010). A basic introduction to fixedeffect and randomeffects models for metaanalysis. Research Synthesis Methods, 1(2), 97–111. https://doi.org/10.1002/jrsm.12 PubMedGoogle Scholar
 Brandt, M. (2007). Bridging the gap between measurement models and theories of human memory. Zeitschrift für Psychologie/Journal of Psychology, 215(1), 72–85. https://doi.org/10.1027/00443409.215.1.72 https://doi.org/10.1027/00443409.215.1.72 Google Scholar
 Bransford, J.D., & Johnson, M.K. (1972). Contextual prerequisites for understanding: Some investigations of comprehension and recall. Journal of Verbal Learning and Verbal Behavior, 11(6), 717–726. https://doi.org/10.1016/S00225371(72)800069 Google Scholar
 Bröder, A., & Malejka, S. (2016). On a problematic procedure to manipulate response biases in recognition experiments: the case of implied base rates. Memory, 1–8. https://doi.org/10.1080/09658211.2016.1214735 PubMedGoogle Scholar
 Campitelli, G., & Gerrans, P. (2014). Does the cognitive reflection test measure cognitive reflection? A mathematical modeling approach. Memory & Cognition, 42(3), 434–447. https://doi.org/10.3758/s1342101303679 https://doi.org/10.3758/s1342101303679 Google Scholar
 Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., ..., Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 1–32. https://doi.org/10.18637/jss.v076.i01 Google Scholar
 Chater, N., & Oaksford, M. (1999). The probability heuristics model of syllogistic reasoning. Cognitive Psychology, 38(2), 191–258. https://doi.org/10.1006/cogp.1998.0696 PubMedGoogle Scholar
 Cherubini, P., Garnham, A., Oakhill, J., & Morley, E. (1998). Can any ostrich fly?: Some new data on belief bias in syllogistic reasoning. Cognition, 69(2), 179–218. https://doi.org/10.1016/S00100277(98)00064X PubMedGoogle Scholar
 Christensen, R. (2011). Plane answers to complex questions: the theory of linear models. Springer Science & Business Media.Google Scholar
 Cohen, A.L., Sanborn, A.N., & Shiffrin, R.M. (2008). Model evaluation using grouped or individual data. Psychonomic Bulletin & Review, 15(4), 692–712. https://doi.org/10.3758/PBR.15.4.692 https://doi.org/10.3758/PBR.15.4.692 Google Scholar
 Condorcet, M.D.E. (1785). Essay on the application of analysis to the probability of majority decisions. Paris: Imprimerie Royale.Google Scholar
 Dawson, E., Gilovich, T., & Regan, D.T. (2002). Motivated reasoning and performance on the was on selection task. Personality and Social Psychology Bulletin, 28(10), 1379–1387. https://doi.org/10.1177/014616702236869 Google Scholar
 DeCarlo, L.T. (1998). Signal detection theory and generalized linear models. Psychological Methods, 3(2), 186–205. https://doi.org/10.1037/1082989X.3.2.186 Google Scholar
 DeCarlo, L.T. (2011). Signal detection theory with item effects. Journal of Mathematical Psychology, 55(3), 229–239. https://doi.org/10.1016/j.jmp.2011.01.002 Google Scholar
 Dube, C., Rotello, C., & Pazzaglia, A. (2013). The statistical accuracy and theoretical status of discretestate MPT models: Reply to Batchelder and Alexander (2013). Psychological Bulletin, 139, 1213–1220. https://doi.org/10.1037/a0034453 PubMedGoogle Scholar
 Dube, C., & Rotello, C.M. (2012). Binary ROCs in perception and recognition memory are curved. Journal of Experimental Psychology: Learning, Memory, and Cognition, 38(1), 130–151. https://doi.org/10.1037/a0024957 PubMedGoogle Scholar
 Dube, C., Rotello, C.M., & Heit, E. (2010). Assessing the belief bias effect with ROCs: It’s a response bias effect. Psychological Review, 117(3), 831–863. https://doi.org/10.1037/a0019634 PubMedGoogle Scholar
 Dube, C., Rotello, C. M., & Heit, E. (2011). The belief bias effect is aptly named: A reply to Klauer and Kellen (2011). Psychological Review, 118(1), 155–163. https://doi.org/10.1037/a0021774 PubMedGoogle Scholar
 Dube, C., Starns, J. J., Rotello, C. M., & Ratcliff, R. (2012). Beyond ROC curvature: Strength effects and response time data support continuousevidence models of recognition memory. Journal of Memory and Language, 67, 389–406. https://doi.org/10.1016/j.jml.2012.06.002 PubMedPubMedCentralGoogle Scholar
 Erdfelder, E., Auer, T.S., Hilbig, B. E., Aßfalg, A., Moshagen, M., & Nadarevic, L. (2009). Multinomial processing tree models. Zeitschrift fur Psychologie/Journal of Psychology, 217(3), 108–124. https://doi.org/10.1027/00443409.217.3.108 Google Scholar
 Estes, W.K. (1956). The problem of inference from curves based on group data. Psychological Bulletin, 53(2), 134–140. https://doi.org/10.1037/h0045156 PubMedGoogle Scholar
 Estes, W. K., & Todd Maddox, W. (2005). Risks of drawing inferences about cognitive processes from model fits to individual versus average performance. Psychonomic Bulletin & Review, 12(3), 403–408. https://doi.org/10.3758/BF03193784 Google Scholar
 Evans, J.S.B.T. (2002). Logic and human reasoning: An assessment of the deduction paradigm. Psychological Bulletin, 128(6), 978–996. https://doi.org/10.1037//00332909.128.6.978 PubMedGoogle Scholar
 Evans, J.S.B.T., Barston, J.L., & Pollard, P. (1983). On the conflict between logic and belief in syllogistic reasoning. Memory & Cognition, 11(3), 295–306. https://doi.org/10.3758/BF03196976 Google Scholar
 Evans, J.S.B.T., & CurtisHolmes, J. (2005). Rapid responding increases belief bias: Evidence for the dualprocess theory of reasoning. Thinking & Reasoning, 11(4), 382–389. https://doi.org/10.1080/13546780542000005 Google Scholar
 Evans, J.S.B.T., Handley, S.J., & Harper, C.N.J. (2001). Necessity, possibility and belief: A study of syllogistic reasoning. The Quarterly Journal of Experimental Psychology Section A, 54(3), 935–958. https://doi.org/10.1080/713755983 Google Scholar
 Evans, J.S.B. T., & Stanovich, K.E. (2013). Dualprocess theories of higher cognition advancing the debate. Perspectives on Psychological Science, 8(3), 223–241. https://doi.org/10.1177/1745691612460685 https://doi.org/10.1177/1745691612460685 PubMedGoogle Scholar
 Evans, J.S.B. T., Handley, S. J., Harper, C. N. J., & JohnsonLaird, P. N. (1999). Reasoning about necessity and possibility: A test of the mental model theory of deduction. Journal of Experimental Psychology: Learning, Memory, and Cognition, 25(6), 1495–1513. https://doi.org/10.1037/02787393.25.6.1495 Google Scholar
 Feather, N.T. (1964). Acceptance and rejection of arguments in relation to attitude strength, critical ability, and intolerance of inconsistency. The Journal of Abnormal and Social Psychology, 69(2), 127–136. https://doi.org/10.1037/h0046290 Google Scholar
 Frederick, S. (2005). Cognitive reflection and decision making. The Journal of Economic Perspectives, 19(4), 25–42. https://doi.org/10.1257/089533005775196732 Google Scholar
 Gabry, J., & Goodrich, B. (2016). Rstanarm: Bayesian applied regression modeling via Stan. R package version 2.13.1.Google Scholar
 Gelman, A., & Shalizi, C.R. (2013). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66(1), 8–38. https://doi.org/10.1111/j.20448317.2011.02037.x PubMedGoogle Scholar
 Gelman, A., Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013) Bayesian data analysis, 3rd Edn. Hoboken: CRC Press. ISBN: 9781439898208.Google Scholar
 Green, D.M., & Moses, F.L. (1966). On the equivalence of two recognition measures of shortterm memory. Psychological Bulletin, 66(3), 228–234. https://doi.org/10.1037/h0023645.PubMedGoogle Scholar
 Green, D.M., & Swets, J.A. (1966) Signal detection theory and psychophysics. New York: Wiley.Google Scholar
 Gronau, Q.F., Singmann, H., & Wagenmakers, E.J. (2017). Bridgesampling: An R package for estimating normalizing constants. arXiv:1710.08162 [stat]
 Guyote, M.J., & Sternberg, R.J. (1981). A transitivechain theory of syllogistic reasoning. Cognitive Psychology, 13(4), 461–525. https://doi.org/10.1016/00100285(81)900189 Google Scholar
 Haigh, M., Stewart, A.J., & Connell, L. (2013). Reasoning as we read: Establishing the probability of causal conditionals. Memory & Cognition, 41(1), 152–158. https://doi.org/10.3758/s1342101202500 https://doi.org/10.3758/s1342101202500 Google Scholar
 Heathcote, A., Brown, S., & Mewhort, D.J.K. (2000). The power law repealed: The case for an exponential law of practice. Psychonomic Bulletin & Review, 7(2), 185–207. https://doi.org/10.3758/BF03212979 https://doi.org/10.3758/BF03212979 Google Scholar
 Heit, E., & Rotello, C.M. (2014). Traditional differencescore analyses of reasoning are flawed. Cognition, 131 (1), 75–91. https://doi.org/10.1016/j.cognition.2013.12.003 PubMedGoogle Scholar
 Iverson, G., & Bamber, D. (1997). The generalized area theorem in signal detection theory. In Choice, decision, and measurement: Essays in honor of R. Duncan Luce (pp. 301–318). Hillsdale, NJ: Lawrence Erlbaum & Associates.Google Scholar
 JohnsonLaird, P.N. (1983) Mental models: towards a cognitive science of language, inference, and consciousness. Cambridge: Harvard University Press.Google Scholar
 JohnsonLaird, P.N., & Byrne, R.M.J. (1991). Deduction. Lawrence Erlbaum Associates, Inc.Google Scholar
 Judd, C.M., Westfall, J., & Kenny, D.A. (2012). Treating stimuli as a random factor in social psychology: A new and comprehensive solution to a pervasive but largely ignored problem. Journal of Personality and Social Psychology, 103(1), 54–69. https://doi.org/10.1037/a0028347 PubMedGoogle Scholar
 Kass, R.E., & Raftery, A.E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795. https://doi.org/10.1080/01621459.1995.10476572 Google Scholar
 Katahira, K. (2016). How hierarchical models improve point estimates of model parameters at the individual level. Journal of Mathematical Psychology, 73, 37–58. https://doi.org/10.1016/j.jmp.2016.03.007 Google Scholar
 Kaufmann, H., & Goldstein, S. (1967). The effects of emotional value of conclusions upon distortion in syllogistic reasoning. Psychonomic Science, 7(10), 367–368. https://doi.org/10.3758/BF03331127 https://doi.org/10.3758/BF03331127 Google Scholar
 Kellen, D., & Klauer, K.C. (2011). Evaluating models of recognition memory using first and secondchoice responses. Journal of Mathematical Psychology, 55, 251–266. https://doi.org/10.1016/j.jmp.2010.11.004 Google Scholar
 Kellen, D. (2014). Discretestate and continuous models of recognition memory: Testing core properties under minimal assumptions. Journal of Experimental Psychology: Learning, Memory, and Cognition, 40, 1795–1804. https://doi.org/10.1037/xlm0000016 PubMedGoogle Scholar
 Kellen, D., & Klauer, K. C. (2018). Elementary signal detection and threshold theory. In J. T. Wixted (Ed.) Stevens’ handbook of experimental psychology and cognitive neuroscience (pp. 1–39). Wiley. https://doi.org/10.1002/9781119170174.epcn505
 Kellen, D., Klauer, K.C., & Bröder, A. (2013). Recognition memory models and binaryresponse ROCs: A comparison by minimum description length. Psychonomic Bulletin & Review, 20 (4), 693–719. https://doi.org/10.3758/s1342301304072 Google Scholar
 Kellen, D., Singmann, H., Vogt, J., & Klauer, K. C. (2015). Further evidence for discretestate mediation in recognition memory. Experimental Psychology, 62, 40–53.PubMedPubMedCentralGoogle Scholar
 Khemlani, S., & JohnsonLaird, P.N. (2012). Theories of the syllogism: A metaanalysis. Psychological Bulletin, 138(3), 427–457. https://doi.org/10.1037/a0026841 PubMedGoogle Scholar
 Killeen, P.R., & Taylor, T.J. (2004). Symmetric receiver operating characteristics. Journal of Mathematical Psychology, 48(6), 432–434. https://doi.org/10.1016/j.jmp.2004.08.005 Google Scholar
 Kinchla, R. A. (1994). Comments on Batchelder and Riefer’s multinomial model for source monitoring. Psychological Review, 101, 166–171. https://doi.org/10.1037//0033295x.101.1.166 PubMedGoogle Scholar
 Klauer, K.C. (2010). Hierarchical multinomial processing tree models: A latenttrait approach. Psychometrika, 75(1), 70–98. https://doi.org/10.1007/s1133600991410 Google Scholar
 Klauer, K.C., & Kellen, D. (2010). Toward a complete decision model of item and source recognition: A discretestate approach . Psychonomic Bulletin & Review, 17(4), 465–478. https://doi.org/10.3758/PBR.17.4.465 Google Scholar
 Klauer, K.C. (2011). The flexibility of models of recognition memory: An analysis by the minimumdescription length principle. Journal of Mathematical Psychology, 55(6), 430–450. https://doi.org/10.1016/j.jmp.2011.09.002.Google Scholar
 Klauer, K.C., & Kellen, D. (2011). Assessing the belief bias effect with ROCs: Reply to Dube, Rotello, and Heit (2010). Psychological Review, 118(1), 164–173. https://doi.org/10.1037/a0020698 PubMedGoogle Scholar
 Klauer, K.C. (2015). The flexibility of models of recognition memory: The case of confidence ratings. Journal of Mathematical Psychology, 67, 8–25. https://doi.org/10.1016/j.jmp.2015.05.002 Google Scholar
 Klauer, K.C. (2018). RTMPTs: Process models for responsetime distributions based on multinomial processing trees with applications to recognition memory. Journal of Mathematical Psychology, 82, 111–130. https://doi.org/10.1016/j.jmp.2017.12.003 Google Scholar
 Klauer, K.C., Musch, J., & Naumer, B. (2000). On belief bias in syllogistic reasoning. Psychological Review, 107(4), 852–884. https://doi.org/10.1037//0033295X.107.4.852 PubMedGoogle Scholar
 Klauer, K.C., Stahl, C., & Erdfelder, E. (2007). The abstract selection task: New data and an almost comprehensive model. Journal of Experimental Psychology: Learning, Memory, and Cognition, 33(4), 680–703. https://doi.org/10.1037/02787393.33.4.680 PubMedGoogle Scholar
 Klugkist, I., & Hoijtink, H. (2007). The Bayes factor for inequality and about equality constrained models. Computational Statistics & Data Analysis, 51(12), 6367–6379. https://doi.org/10.1016/j.csda.2007.01.024 Google Scholar
 Krauth, J. (1982). Formulation and experimental verification of models in propositional reasoning. The Quarterly Journal of Experimental Psychology, 34(2), 285–298. https://doi.org/10.1080/14640748208400842 Google Scholar
 Kruschke, J.K. (2015) Doing Bayesian data analysis: A tutorial introduction with R, JAGS and Stan. London: Academic Press.Google Scholar
 Kunda, Z. (1990). The case for motivated reasoning. Psychological Bulletin, 108 (3), 480–498. https://doi.org/10.1037/00332909.108.3.480 PubMedGoogle Scholar
 Lee, M.D., & Wagenmakers, E.J. (2013) Bayesian cognitive modeling: A practical course. Cambridge: Cambridge University Press.Google Scholar
 Lewandowski, D., Kurowicka, D., & Joe, H. (2009). Generating random correlation matrices based on vines and extended onion method. Journal of Multivariate Analysis, 100(9), 1989–2001. https://doi.org/10.1016/j.jmva.2009.04.008 Google Scholar
 Little, R. J. A., & Rubin, D. B. (1997) Statistical analysis with missing data, 2nd Edn. New York: Wiley.Google Scholar
 Lord, C.G., Ross, L., & Lepper, M.R. (1979). Biased assimilation and attitude polarization: The effects of prior theories on subsequently considered evidence. Journal of Personality and Social Psychology, 37(11), 2098–2109. https://doi.org/10.1037/00223514.37.11.2098 Google Scholar
 Macmillan, N.A., & Creelman, C.D. (2005) Detection theory: A user’s guide. New York: Lawrence Erlbaum associates.Google Scholar
 Malmberg, K.J. (2002). On the form of ROCs constructed from confidence ratings. Journal of Experimental Psychology: Learning, Memory, and Cognition, 28(2), 380–387. https://doi.org/10.1037/02787393.28.2.380 PubMedGoogle Scholar
 Malmberg, K.J., & Xu, J. (2006). The influence of averaging and noisy decision strategies on the recognition memory ROC. Psychonomic Bulletin & Review, 13(1), 99–105. https://doi.org/10.3758/BF03193819 Google Scholar
 Markovits, H., & Nantel, G. (1989). The beliefbias effect in the production and evaluation of logical conclusions. Memory & Cognition, 17(1), 11–17. https://doi.org/10.3758/BF03199552 Google Scholar
 Miller, M.B., Van Horn, J. D., Wolford, G. L., Handy, T. C., ValsangkarSmyth, M., Inati, S., ..., Gazzaniga, M. S. (2002). Extensive individual differences in brain activations associated with episodic retrieval are reliable over time. Journal of Cognitive Neuroscience, 14(8), 1200–1214. https://doi.org/10.1162/089892902760807203 PubMedGoogle Scholar
 Monnahan, C.C., Thorson, J.T., & Branch, T.A. (2016). Faster estimation of Bayesian models in ecology using Hamiltonian Monte Carlo. Methods in Ecology and Evolution, n/an/a. https://doi.org/10.1111/2041210X.12681 Google Scholar
 Moran, R. (2016). Thou shalt identify! The identifiability of two highthreshold models in confidencerating recognition (and superrecognition) paradigms. Journal of Mathematical Psychology, 73, 1–11. https://doi.org/10.1016/j.jmp.2016.03.002 Google Scholar
 Morey, R.D., Pratte, M.S., & Rouder, J.N. (2008). Problematic effects of aggregation in z ROC analysis and a hierarchical modeling solution. Journal of Mathematical Psychology, 52(6), 376–388. https://doi.org/10.1016/j.jmp.2008.02.001 Google Scholar
 Morley, N.J., Evans, J.S.B.T., & Handley, S.J. (2004). Belief bias and figural bias in syllogistic reasoning. The Quarterly Journal of Experimental Psychology Section A, 57(4), 666–692. https://doi.org/10.1080/02724980343000440 Google Scholar
 Newell, A., Rosenbloom, P.S., & Anderson, J.R. (1981). Mechanisms of skill acquisition and the law of practice. In Cognitive skills and their acquisition (pp. 1–55). Hillsdale, NJ: Erlbaum.Google Scholar
 Newstead, S.E., Pollard, P., Evans, J. S. B. T., & Allen, J. L. (1992). The source of belief bias effects in syllogistic reasoning. Cognition, 45(3), 257–284. https://doi.org/10.1016/00100277(92)90019E PubMedGoogle Scholar
 Nickerson, R.S. (1998). Confirmation bias: A ubiquitous phenomenon in many guises. Review of General Psychology, 2(2), 175–220. https://doi.org/10.1037/10892680.2.2.175 Google Scholar
 Nuobaraite, S. (2013) The role of egodepletion on motivated reasoning. UK: Bachelor, Plymouth University.Google Scholar
 Oakhill, J., JohnsonLaird, P.N., & Garnham, A. (1989). Believability and syllogistic reasoning. Cognition, 31(2), 117–140. https://doi.org/10.1016/00100277(89)900206 PubMedGoogle Scholar
 Oakhill, J., & JohnsonLaird, P.N. (1985). The effects of belief on the spontaneous production of syllogistic conclusions. The Quarterly Journal of Experimental Psychology Section A, 37(4), 553–569. https://doi.org/10.1080/14640748508400919 Google Scholar
 Oaksford, M., & Chater, N. (2007) Bayesian rationality: the probabilistic approach to human reasoning. Oxford: Oxford University Press.Google Scholar
 Oberauer, K. (2006). Reasoning with conditionals: A test of formal models of four theories. Cognitive Psychology, 53(3), 238–283. https://doi.org/10.1016/j.cogpsych.2006.04.001 PubMedGoogle Scholar
 Oberauer, K., Weidenfeld, A., & Hörnig, R. (2006). Working memory capacity and the construction of spatial mental models in comprehension and deductive reasoning. The Quarterly Journal of Experimental Psychology, 59(2), 426–447. https://doi.org/10.1080/17470210500151717 PubMedGoogle Scholar
 Osth, A.F., & Dennis, S. (2015). Sources of interference in item and associative recognition memory. Psychological Review. https://doi.org/10.1037/a0038692.PubMedGoogle Scholar
 Pazzaglia, A., Dube, Chad, & Rotello, C. (2013). A critical comparison of discretestate and continuous models of recognition memory: Implications for recognition and beyond. Psychological Bulletin, 139, 1173–1203. https://doi.org/10.1037/a0033044.PubMedGoogle Scholar
 Pennycook, G., Fugelsang, J.A., & Koehler, D.J. (2015). Everyday consequences of analytic thinking. Current Directions in Psychological Science. https://doi.org/10.1177/0963721415604610 Google Scholar
 Pennycook, G., Cheyne, J. A., Koehler, D. J., & Fugelsang, J. A. (2016). Is the cognitive reflection test a measure of both reflection and intuition? Behavior Research Methods, 48 (1), 341–348. https://doi.org/10.3758/s1342801505761 PubMedGoogle Scholar
 Polk, T.A., & Newell, A. (1995). Deduction as verbal reasoning. Psychological Review, 102(3), 533–566. https://doi.org/10.1037/0033295X.102.3.533 Google Scholar
 Pratte, M.S., & Rouder, J.N. (2011). Hierarchical single and dualprocess models of recognition memory. Journal of Mathematical Psychology, 55(1), 36–46. Special Issue on Hierarchical Bayesian Models . https://doi.org/10.1016/j.jmp.2010.08.007 Google Scholar
 Pratte, M.S., Rouder, J.N., & Morey, R.D. (2010). Separating mnemonic process from participant and item effects in the assessment of ROC asymmetries. Journal of Experimental Psychology: Learning, Memory, and Cognition, 36(1), 224–232. https://doi.org/10.1037/a0017682 PubMedGoogle Scholar
 Quayle, J.D., & Ball, L.J. (2000). Working memory, metacognitive uncertainty, and belief bias in syllogistic reasoning. The Quarterly Journal of Experimental Psychology Section A, 53(4), 1202–1223. https://doi.org/10.1080/713755945 Google Scholar
 Ratcliff, R., & Rouder, J.N. (1998). Modeling response times for twochoice decisions. Psychological Science, 9(5), 347–356. https://doi.org/10.1111/14679280.00067 Google Scholar
 Regenwetter, M., Dana, J., & DavisStober, C.P. (2011). Transitivity of preferences. Psychological Review, 118(1), 42–56. https://doi.org/10.1037/a0021150 PubMedGoogle Scholar
 Rijmen, F., Tuerlinckx, F., De Boeck, P., & Kuppens, P. (2003). A nonlinear mixed model framework for item response theory. Psychological Methods, 8(2), 185–205. https://doi.org/10.1037/1082989X.8.2.185 PubMedGoogle Scholar
 Robert, C., & Casella, G. (2009). Introducing Monte Carlo methods with R. Springer Science & Business Media.Google Scholar
 Roberts, M.J., & Sykes, E.D.A. (2003). Belief bias and relational reasoning. The Quarterly Journal of Experimental Psychology Section A, 56(1), 131–154. https://doi.org/10.1080/02724980244000233 https://doi.org/10.1080/02724980244000233 Google Scholar
 Roser, M.E., Evans, J. S. B. T., McNair, N. A., Fuggetta, G., Handley, S. J., Carroll, L. S., & Trippas, D. (2015). Investigating reasoning with multiple integrated neuroscientific methods. Frontiers in Human Neuroscience, 9. https://doi.org/10.3389/fnhum.2015.00041
 Rotello, C.M., Heit, E., & Dubé, C. (2015). When more data steer us wrong: replications with the wrong dependent measure perpetuate erroneous conclusions. Psychonomic Bulletin & Review, 22(4), 944–954. https://doi.org/10.3758/s1342301407592 Google Scholar
 Rottman, B.M., & Hastie, R. (2016). Do people reason rationally about causally related events? Markov violations, weak inferences, and failures of explaining away. Cognitive Psychology, 87, 88–134. https://doi.org/10.1016/j.cogpsych.2016.05.002 PubMedGoogle Scholar
 Rouder, J.N., & Jun, L. (2005). An introduction to Bayesian hierarchical models with an application in the theory of signal detection. Psychonomic Bulletin & Review, 12(4), 573–604. https://doi.org/10.3758/BF03196750 Google Scholar
 Rouder, J.N., Lu, J., Morey, R. D., Sun, D., & Speckman, P. L. (2008). A hierarchical processdissociation model. Journal of Experimental Psychology: General, 137(2), 370–389. https://doi.org/10.1037/00963445.137.2.370 Google Scholar
 Schafer, J. L. (1997) Analysis of incomplete multivariate data. New York: Chapman and Hall.Google Scholar
 Scheibehenne, B., & Pachur, T. (2015). Using Bayesian hierarchical parameter estimation to assess the generalizability of cognitive models of choice. Psychonomic Bulletin & Review, 22(2), 391–407. https://doi.org/10.3758/s1342301406844 Google Scholar
 Schielzeth, H., & Forstmeier, W. (2009). Conclusions beyond support: overconfident estimates in mixed models. Behavioral Ecology, 20(2), 416–420. https://doi.org/10.1093/beheco/arn145 PubMedGoogle Scholar
 Schyns, P.G., & Oliva, Aude (1999). Dr. Angry and Mr. Smile: when categorization flexibly modifies the perception of faces in rapid visual presentations. Cognition, 69(3), 243–265. https://doi.org/10.1016/S00100277(98)000699 PubMedGoogle Scholar
 Shadish, W.R., Cook, T.D., & Campbell, D.T. (2002) Experimental and quasiexperimental designs for generalized causal inference. Houghton: Mifflin and Company.Google Scholar
 Shynkaruk, J.M., & Thompson, V.A. (2006). Confidence and accuracy in deductive reasoning. Memory & Cognition, 34(3), 619–632. https://doi.org/10.3758/BF03193584 Google Scholar
 Simpson, A.J., & Fitter, M.J. (1973). What is the best index of detectability? Psychological Bulletin, 80(6), 481–488. https://doi.org/10.1037/h0035203 Google Scholar
 Singmann, H., & Kellen, D. (2013). MPTinR: Analysis of multinomial processing tree models in R. Behavior Research Methods, 45(2), 560–575. https://doi.org/10.3758/s1342801202590 PubMedPubMedCentralGoogle Scholar
 Singmann, H. (2014). Concerns with the SDT approach to causal conditional reasoning: A comment on Trippas, Handley, Verde, Roser, McNair, and Evans (2014). Frontiers in Psychology, 5, 402. https://doi.org/10.3389/fpsyg.2014.00402 PubMedPubMedCentralGoogle Scholar
 Singmann, H., Klauer, K.C., & Beller, S. (2016). Probabilistic conditional reasoning: Disentangling form and content with the dualsource model. Cognitive Psychology, 88, 61–87. https://doi.org/10.1016/j.cogpsych.2016.06.005 PubMedGoogle Scholar
 Singmann, H., Klauer, K.C., & Over, D.E. (2014). New normative standards of conditional reasoning and the dualsource model. Frontiers in Psychology, 5, 316. https://doi.org/10.3389/fpsyg.2014.00316 PubMedPubMedCentralGoogle Scholar
 SkovgaardOlsen, N., Singmann, H., & Klauer, K.C. (2016). The relevance effect and conditionals. Cognition, 150, 2–36. https://doi.org/10.1016/j.cognition.2015.12.017 Google Scholar
 Skyrms, B. (2000). Choice and chance: An introduction to inductive logic. OCLC: 898995532. Belmont CA.: Wadsworth. Google Scholar
 Smith, J.B., & Batchelder, W.H. (2008). Assessing individual differences in categorical data. Psychonomic Bulletin & Review, 15(4), 713–731. https://doi.org/10.3758/PBR.15.4.713.Google Scholar
 Snijders, T. A. B., & Bosker, R. J. (2012) Multilevel analysis: an introduction to basic and advanced multilevel modeling. Los Angeles: SAGE.Google Scholar
 Stan Development Team (2016). Stan modeling language: User’s guide and reference manual. Version 2.14.0.Google Scholar
 Stanovich, K.E. (1999) Who is rational? Studies of individual differences in reasoning. Mahwah: Lawrence Erlbaum Associates.Google Scholar
 Stanovich, K.E., West, R. F., & Toplak, M. E. (2016) The rationality Quotient: Toward a test of rational thinking. Cambridge: MIT Press.Google Scholar
 Starns, J.J., Ratcliff, R., & McKoon, G. (2012). Evaluating the unequalvariance and dualprocess explanations of zROC slopes with response time data and the diffusion model. Cognitive Psychology, 64(1–2), 1–34. https://doi.org/10.1016/j.cogpsych.2011.10.002 PubMedPubMedCentralGoogle Scholar
 Störring, G. (1908). Experimentelle Untersuchungen über einfache Schlussprozesse. Archiv für die gesamte Psychologie, 11, 1–27.Google Scholar
 Stupple, E.J.N., & Ball, L.J. (2008). Belieflogic conflict resolution in syllogistic reasoning: Inspectiontime evidence for a parallelprocess model. Thinking & Reasoning, 14(2), 168–181. https://doi.org/10.1080/13546780701739782 Google Scholar
 Stupple, E.J.N., Ball, L. J., Evans, J. S. B. T., & KamalSmith, E. (2011). When logic and belief collide: Individual differences in reasoning times support a selective processing model. Journal of Cognitive Psychology, 23 (8), 931–941. https://doi.org/10.1080/20445911.2011.589381 Google Scholar
 Thompson, V.A. (2000). The taskspecific nature of domaingeneral reasoning. Cognition, 76, 209–268. https://doi.org/10.1016/S00100277(00)000822 PubMedGoogle Scholar
 Thompson, V.A., Turner, J.A.P., & Pennycook, G. (2011). Intuition, reason, and metacognition. Cognitive Psychology, 63(3), 107–140. https://doi.org/16/j.cogpsych.2011.06.001.Google Scholar
 Thompson, V.A., Striemer, C. L., Reikoff, R., Gunter, R. W., & Campbell, J. I. D. (2003). Syllogistic reasoning time: Disconfirmation disconfirmed. Psychonomic Bulletin & Review, 10(1), 184–189. https://doi.org/10.3758/BF03196483 Google Scholar
 Toplak, M.E., West, R.F., & Stanovich, K.E. (2011). The Cognitive Reflection Test as a predictor of performance on heuristicsandbiases tasks. Memory & Cognition, 39(7), 1275. https://doi.org/10.3758/s1342101101041 Google Scholar
 Tourangeau, R., Rips, L. J., & Rasinski, K. (2000) The psychology of survey response. Cambridge: Cambridge University Press.Google Scholar
 Trippas, D. (2013). Motivated reasoning and response bias: A signal detection approach. Doctoral dissertation. https://pearl.plymouth.ac.uk//handle/10026.1/2853 (visited on 12/21/2016).
 Trippas, D., Handley, S.J., & Verde, M.F. (2013). The SDT model of belief bias: Complexity, time, and cognitive ability mediate the effects of believability. Journal of Experimental Psychology: Learning, Memory, and Cognition, 39(5), 1393–1402. https://doi.org/10.1037/a0032398 PubMedGoogle Scholar
 Trippas, D., Thompson, V.A., & Handley, S.J. (2017). When fast logic meets slow belief: Evidence for a parallelprocessing model of belief bias. Memory & Cognition, 45, 539–552.Google Scholar
 Trippas, D., Verde, M.F., & Handley, S.J. (2014). Using forced choice to test belief bias in syllogistic reasoning. Cognition, 133(3), 586–600. https://doi.org/10.1016/j.cognition.2014.08.009 PubMedGoogle Scholar
 Trippas, D., Pennycook, G., Verde, M. F., & Handley, S. J. (2015). Better but still biased: Analytic cognitive style and belief bias. Thinking & Reasoning, 1–15. https://doi.org/10.1080/13546783.2015.1016450 Google Scholar
 Van Zandt, T. (2000). ROC curves and confidence judgments in recognition memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 26, 582–600. https://doi.org/10.1037/02787393.26.3.582 PubMedGoogle Scholar
 Vandekerckhove, J., Matzke, D., & Wagenmakers, E.J. (2015). Model comparison and the principle of parsimony. In J.R. Busemeyer (Ed.) Oxford handbook of computational and mathematical psychology (pp. 300–319). Oxford: Oxford University Press.Google Scholar
 Vandekerckhove, J., Tuerlinckx, F., & Lee, M.D. (2011). Hierarchical diffusion models for twochoice response times. Psychological Methods, 16(1), 44–62. https://doi.org/10.1037/a0021765 PubMedGoogle Scholar
 Verde, M.F., Macmillan, N.A., & Rotello, C.M. (2006). Measures of sensitivity based on a single hit rate and false alarm rate: The accuracy, precision, and robustness of d’, Az, and A’. Perception & Psychophysics, 68(4), 643–654. https://doi.org/10.3758/BF03208765 Google Scholar
 Wagenmakers, E.J., Krypotos, A.M., Criss, A. H., & Iverson, G. (2012). On the interpretation of removable interactions: A survey of the field 33 years after Loftus. Memory & Cognition, 40(2), 145–160. https://doi.org/10.3758/s1342101101580 Google Scholar
 Wason, P.C. (1960). On the failure to eliminate hypotheses in a conceptual task. Quarterly Journal of Experimental Psychology, 12(3), 129–140. https://doi.org/10.1080/17470216008416717 Google Scholar
 Wason, P.C. (1966). Reasoning. In B. M. Foss (Ed.) New horizons in psychology (Vol. 1, pp. 135–151). Harmandsworth, England, Penguin.Google Scholar
 Wason, P.C. (1968). Reasoning about a rule. Quarterly Journal of Experimental Psychology, 20(3), 273–281. https://doi.org/10.1080/14640746808400161 PubMedGoogle Scholar
 Wason, P.C., & Evans, J.S.B.T. (1974). Dual processes in reasoning? Cognition, 3(2), 141–154. https://doi.org/10.1016/00100277(74)900171, http://www.sciencedirect.com/science/article/pii/0010027774900171 (visited on 01/06/2017).Google Scholar
 Whitehead, A. (2003). Metaanalysis of controlled clinical trials. OCLC: 255233509. Wiley: Chichester.Google Scholar
 Wickens, T.D., & Hirshman, E. (2000). False memories and statistical design theory: Comment on Miller and Wolford (1999) and Roediger and McDermott (1999). Psychological Review, 107(2), 377–383. https://doi.org/10.1037/0033295X.107.2.377 PubMedGoogle Scholar
 Wilkins, M.C. (1929). The effect of changed material on ability to do formal syllogistic reasoning. Archives of Psychology, 102, 83.Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.