Transitive reasoning distorts induction in causal chains
Abstract
A probabilistic causal chain A→B→C may intuitively appear to be transitive: If A probabilistically causes B, and B probabilistically causes C, A probabilistically causes C. However, probabilistic causal relations can only guaranteed to be transitive if the socalled Markov condition holds. In two experiments, we examined how people make probabilistic judgments about indirect relationships A→C in causal chains A→B→C that violate the Markov condition. We hypothesized that participants would make transitive inferences in accordance with the Markov condition although they were presented with counterevidence showing intransitive data. For instance, participants were successively presented with data entailing positive dependencies A→B and B→C. At the same time, the data entailed that A and C were statistically independent. The results of two experiments show that transitive reasoning via a mediating event B influenced and distorted the induction of the indirect relation between A and C. Participants’ judgments were affected by an interaction of transitive, causalmodelbased inferences and the observed data. Our findings support the idea that people tend to chain individual causal relations into mental causal chains that obey the Markov condition and thus allow for transitive reasoning, even if the observed data entail that such inferences are not warranted.
Keywords
Transitivity Causality Markov condition Mixing of causal relationships Probabilistic reasoning Causal induction Knowledgebased induction Causal learning Causal coherence Transitive distortion effects CategorizationTransitive reasoning enables judgments about unobserved relationships based on indirect evidence. If one observes that object A is heavier than object B, and that B is heavier than C, one can infer that A is heavier than C. Not all relations, however, are transitive. If A is the mother of B, and B is the mother of C, this does not mean that A is the mother of C.
We investigate whether and to what extent people reason transitively about causal relations, even when the conditions for transitive inferences do not hold true. We focus on probabilistic causal chains of the type A→B→C, where individual relations A→B and B→C can be combined to form a chain A→B→C to make probabilistic inferences from the chain’s initial event A to the terminal event C. First, we specify the conditions under which transitive reasoning in causal chains is valid. We then report the findings of two experiments investigating whether people make transitive inferences even when the available data entail that such inferences are not warranted.
Our research builds on the idea that people represent the world in terms of mental causal models (Sloman, 2005; Waldmann, 1996; Waldmann, Cheng, Hagmayer, & Blaisdell, 2008; Waldmann & Hagmayer, 2001) that share key characteristics with causal Bayes nets, which originated in philosophy and machine learning (Pearl, 2000; Spirtes, Glymour, & Scheines, 1993). The key question of the present research is whether judgments about indirect relations in causal chains are influenced by transitive reasoning even when the data show that the chain is intransitive.
Transitive reasoning in causal chains
P(elevated transmitterknockout mice) > P(elevated transmitternormal mice)
P(anxietyelevated transmitter) > P(anxietyno elevated transmitter)
P(anxietyknockout mice) > P(anxietynormal mice)
Note, however, that neither study has directly assessed this indirect relation. Rather, the two direct relations are integrated into a causal chain that guides the inference about the indirect relation.
We refer to such inferences about indirectly related events in causal chains as transitive inferences. Formally, probabilistic transitive inferences in chains are valid if the Markov condition holds (Bonnefon, Da Silva Neves, Dubois, & Prade, 2012). If the Markov condition does not hold, probabilities inferred through transitive inferences via Eq. 2 may deviate from the actual relations in the data.
Markov violations and categorybased transitive inference
The Markov condition enables transitive causal inferences. Its normative and descriptive status, however, is highly disputed. Some advocates of Bayes nets have defended the Markov condition as being a universal characteristic of causal relations in the world or of their representations (Hausman & Woodward, 1999, 2004; Spohn, 2001). Others have criticized the condition’s ontological or epistemological necessity (Arntzenius, 2005; Cartwright, 2001, 2002, 2006, 2007; Sober, 1987, 2001; Sober & Steel, 2012; Steel, 2006). However, even advocates of a universal Markov condition concede that the latter can be violated psychologically when (i) the event categories used are inadequate, (ii) there is a mismatch between causal representations and the true causal structure, or (iii) hidden external variables are correlated (Hausman & Woodward, 1999, 2004; Spohn, 2001). In short, both advocates and critics agree that the Markov condition can be violated in practice when descriptions of the world are incomplete or inadequate.
Whereas the data indicate a negative relation between gene and behavior, a different conclusion results from transitive reasoning. Inferring the probability P(CA) from the parameters of the chain’s direct relations via Eq. 2 yields erroneous estimates of P(CA) = .56 and P(C¬A) = .44, indicating that knockout mice are more likely to show high anxiety levels than normal mice. This discrepancy results from the causal chain being intransitive due to a violation of the Markov condition. Inferring P(CA) via Eq. 2 is normally only valid if A and C are independent conditional on B, which is not the case, as P(CB∧A) = .5, but P(CB∧¬A) = 1.
In this example the Markov condition is violated at the category level due to mixing heterogeneous items (or subclasses) with varying deterministic relationships (Cartwright, 2001; Hausman & Woodward, 1999; Spirtes et al., 1993). For instance, for a mouse (here symbolized by a circle) in the top left corner of Fig. 1, the relations A→B→C and A→C hold. Conversely, for the mouse in the bottom left corner, A→¬B→¬C and A→¬C hold. Thus, on the item level there is no discrepancy between the direct relations A→B and B→C and the indirect relation A→C. Causal relations, however, are typically defined at the level of categories, for example, mice that have gene A versus mice that do not. On this level, however, the causal chain A→B→C is intransitive. The key question of our studies is whether people make transitive inferences on the category level even when the observed causal chain is intransitive.
Previous research on transitive inferences in causal chains
Previous research on causal chains has shown that people make transitive inferences from A to C when observing only relations A→B and B→C (Ahn & Dennis, 2000; Baetu & Baker, 2009). Ahn and Dennis (2000) used a sequential learning paradigm, providing participants with evidence on the covariation of events A and B (fertilizer→level of chemicals in soil) intermixed with evidence about events B and C (chemicals→blooming of flower). They additionally studied trialorder effects by varying whether positive or negative evidence for local contingencies (between A and B, and between B and C) was presented first. Participants received no data on the relation between fertilization (A) and blooming (C) but were asked to judge this indirect relation. Average causal judgments were positive, with higher judgments in the positiveevidencefirst condition. A primacy effect when constructing the local relations in conjunction with transitive reasoning may explain this.
Baetu and Baker (2009) investigated the influence of transitive reasoning with chains in more detail. In their studies, positive, negative, and zero contingencies for A→B and B→C were combined. Participants learned about the contingency between two lights A and B (while light C was covered), and between lights B and C (while A was covered); trials occurred intermixed. Subsequently, participants first rated the global A→C relationship and then the local relationships A→B and B→C. Baetu and Baker used ∆P = P(effectcause) − P(effect¬cause) as a measure of causal strength. If the causal Markov condition holds, ∆P_{A→C}= ∆P_{A→B} × ∆P_{B→C} (Baetu & Baker, 2009, Appendix). Although participants never observed the contingency A→C, their judgments were consistent with a multiplication of the individual contingencies, indicating transitive reasoning.
Goals and scope
The studies of Ahn and Dennis (2000) and Baetu and Baker (2009) indicate that in the absence of direct evidence on the relation A→C, people make transitive inferences as if they are inducing causal chains that obey the Markov condition. In these studies, no contradictory data were available, so it seems like a reasonable default assumption for learners.
Our research goes beyond these studies by investigating reasoning with intransitive chains in which the Markov condition is violated. Intransitive chains provide a stronger test for the hypothesis that people integrate the links into causal models and tend to reason causally coherent, deviating from the data, as if the Markov condition holds (causal coherence hypothesis). Also, we did not use a sequential learning task (Ahn & Dennis, 2000; Baetu & Baker, 2009; cf. Hebbelmann & von Sydow, 2014; von Sydow, Hagmayer, Meder, & Waldmann, 2010) but presented all items in an overview format. This type of format allows participants to detect that the Markov condition does not hold on the category level, because subclasses of items with different contingencies are mixed. In addition to eliciting probability judgments on the category level, we obtained judgments about individual items to investigate the relationship between categorybased and datadriven inferences.
The participants’ task was to judge the conditional probability of C given A, P(CA), after being presented with the data.^{2} Our key question was whether people would recognize the independence of A and C based on the observed data, or whether they would induce a Markovcoherent causal chain and transitively infer a positive relation. If they based their inferences solely on the available data, participants should judge that A and C are independent, P(CA) = P(C¬A) = .5. In contrast, if people induce a causal chain A→B→C and assume the Markov condition, they should infer that A and C are positively related, with P(CA) = .625 (see Eq. 2). In Experiment 1, we elicited judgments on the category and item level. In Experiment 2 we compared several intransitive and transitive chains and investigated additional boundary conditions (e.g., judgment order).
Experiments 1a and 1b
In both experiments, participants were sequentially provided with data regarding the relations A→B and B→C, which also allowed them to observe A→C. The data entailed a violation of the Markov condition, rendering A and C statistically independent. Participants first judged P(BA) and P(CB) for the individual relations and finally estimated P(CA). In Experiment 1a, data were removed before participants judged P(CA). In Experiment 1b, the data remained visible. The question was whether participants would recognize that A and C were independent (e.g., by realizing that the category boundaries were orthogonal), or whether they would derive estimates for P(CA) from their causal model representation and reason transitively from A to C, concluding a positive relation. In a control condition participants were presented with only data on A→C. Since the mediating event B was omitted, they were expected to realize the independence of A and C. We also requested judgments for individual items; the goal was to investigate to what extent people’s judgments were sensitive to the varying contingencies on the item level.
Method
Participants and design.
One hundred twentyeight students from the University of Göttingen participated, in exchange for candy and participation in a lottery where they could win €50. There were sixtyfour participants in each experiment (Experiment 1a: 68% female, M_{age} = 24 years; Experiment 1b: 45% female, M_{age} = 23 years). Participants were randomly assigned to either the chain condition (A→B→C) or the control condition (A→C). In Experiment 1a, one participant who had previously participated in a related study was replaced.
Materials and procedure
Both experiments used the same materials and counterbalancing conditions. The procedure was also identical except for varying whether the data were removed before judging P(CA) (Experiment 1a) or remained visible (Experiment 1b). Participants were asked to take the role of a developmental biologist investigating three developmental stages of microbes. Specifically, the relations between the kinds of carotene developed by the microbes in three consecutive stages (αcarotene, βcarotene, and γcarotene) were of interest.
Stimuli consisted of 40 individual “microbes” represented as circles (Fig. 3), varying on the dimensions “grayscale” and “size” (Fig. 2). A pretest showed that participants could accurately distinguish the individual items. Categories A, B, and C were created by rotating the category boundary, resulting in orthogonal categories A and C. To permit three linearly separable categories, some feature combinations were eliminated (Fig. 2).
Eight counterbalancing conditions with identical contingencies were created by rotating the three category boundaries in steps of 45° (Fig. 2). Accordingly, categories A and C had a onedimensional boundary in four counterbalancing conditions, whereas category B involved a twodimensional boundary. In the other four conditions, A and C had a twodimensional and B a onedimensional category boundary. In all conditions, there was a positive contingency between A and B, and B and C, respectively, whereas A and C were independent: P(CA) = P(C¬A) = .5.
Participants were first presented with data regarding the first and second developmental stages, printed on large panels (Fig. 3). Data for the first stage arranged the 40 microbes according to whether they produced αcarotene (A) or not (¬A). The same microbes in the second stage were arranged according to whether they generated βcarotene (B) or not (¬B). Panels for the first and second stages were displayed simultaneously.
After participants judged P(BA), the C panel was added, showing the microbes that had produced γcarotene (C) and those that had not (¬C; Fig. 3). Using the same rating scale, participants judged the conditional probability P(CB), that is, whether microbes producing βcarotene (B) would or would not produce γcarotene.
In Experiment 1a, after participants judged P(CB), all data panels were removed and participants had to judge P(CA) on the same 11step rating scale. In Experiment 1b, after participants judged P(CB), all data panels remained visible when participants judged P(CA), so that the data showing the independence of A and C were available during the judgment.
The procedure and materials in the control conditions of both experiments were identical, except that participants were shown data only for categories A and C. Accordingly, they provided an estimate of only P(CA), with panels for A and C being visible during judgment.
After participants completed all probability judgments, the data panels were removed and participants were presented with four individual microbes (presented in one of two random orders). Figure 2 indicates the location of these test items with a white spot. We selected these items since they were at least one step away from the category boundary in all counterbalancing conditions, and in all counterbalancing conditions each item belonged to one of four combinations of categories (A∧C, ¬A∧C, A∧¬C, and ¬A∧¬C). For instance, in Fig. 2 the item with size = 1 and grayscale = 3 is of type A∧C (i.e., the microbe produced αcarotene and γcarotene). The goal was to investigate whether judgments were influenced by observed contingencies on the item level and/or transitive inferences based on the category level.
For each test item, participants judged the probability that it would generate γcarotene. To eliminate uncertainty regarding the category membership of A, information was provided for each item on whether it had or had not produced αcarotene. Again, a rating scale of −100 to +100 was used, labeled “This [non] alpha microbe tends not to develop gammacarotene later” on the left side and “tends to develop gammacarotene later” on the right side. Different ratings for A and ¬A items yielding the same effect C would indicate an influence of categorybased transitive reasoning. Different ratings for items belonging to categories C or ¬C would indicate an influence of the actually observed contingency on the item level.
Results
For the analyses, the eight counterbalancing conditions were recoded to match the data structure depicted in Fig. 2. Our predictions for transitive inferences for the global A→C relation rely on qualitatively correct judgments of the local A→B and B→C relationships. Therefore, following previous research (Baetu & Baker, 2009), we included only participants who correctly judged both local relationships to be positive. In Experiment 1a, 25% of participants failed to meet this criterion in each of the two relations; in Experiment 1b an average of 19% failed in each of the two relations.^{3}
These findings indicate that participants relied on transitive reasoning to judge the relation between A and C. Despite being able to detect that these categories of items were independent when B was omitted, they inferred a positive relation when B was the intermediate event. The lower judgments for P(CA) in Experiment 1b indicate that making the data available during judgment increased people’s sensitivity to the independence of A and C.
We next analyzed participants’ probability judgments regarding C for the individual test items, which consisted of four items combining membership of categories A and C (i.e., items of type A and C, A and ¬C, ¬A and C, and ¬A and ¬C; see Fig. 2). Theoretically, if judgments are based solely on transitive reasoning on the category level, there should be identical positive judgments for A items around +25, regardless of whether an item belonged to category C or ¬C. The two ¬A items (¬A and C and ¬A and ¬C) should receive negative ratings of around −25. Conversely, if participants’ judgments are driven solely by the observed data, C items should receive a rating of +100, and ¬C items one of –100, independent of whether they are type A or ¬A. Note that the maximal “bottomup” effect (C vs. ¬C items) is larger than the maximal “topdown” effect (A vs. ¬A items).
A different pattern of judgments was obtained in the chain conditions. For instance, the two items belonging to category C were judged differently depending on the status of A, with higher judgments for A than for ¬A items. Analogously, items that belonged to category ¬C received higher judgments when belonging to A than when belonging to ¬A. The pattern also reflects an influence of the data, as the judgments for the two A items and the two ¬A items varied as a function of true membership regarding C. Accordingly, in Experiment 1a, a main effect of type C versus ¬C, F(1, 15) = 11.67, p <. 01, η_{p}^{2} = .43, a main effect of type A versus ¬A, F(1, 15) = 6.04, p < .05, η_{p}^{2} = .28, and an interaction, F(1, 15) = 9.72, p < .05, η_{p}^{2} = .39, resulted. Thus, in the chain condition participants’ judgments were influenced by both the category level relations and observations on the item level.
In Experiment 1b, the pattern was qualitatively similar, although less pronounced. The ANOVA for the chain condition yielded a significant effect of items of type C versus type ¬C, F(1, 20) = 14.69, p < .01, η_{p}^{2} = .46, as well as—at least for a onetailed test of our prediction—an influence of type A versus ¬A, F(1, 20) = 3.80, p_{onetailed} < .05, η_{p}^{2} = .16, but no interaction, F(1, 20) < 1. The two main effects indicate that judgments on the item level were still influenced by the inferred positive relation between A and C on the category level and the observed data concerning category C versus ¬C. As in Experiment 1a, the effect of the category membership of C was larger than that of category A, consistent with the theoretically maximal effect of these factors.
Percentage (and frequency) of participants in Experiments 1a and 1b judging the relation between A and C to be negative, zero, or positive on a scale of −100 to +100
Experiment  Condition  Negative  Zero  Positive 

1a  Intransitive chain  0% (0)  6% (1)  94% (15) 
Control  28% (9)  56% (18)  16% (5)  
1b  Intransitive chain  19% (4)  33% (7)  48% (10) 
Control  28% (9)  56% (18)  16% (5) 
Discussion
The studies show that participants who made correct judgments regarding the two local relations A→B and B→C judged P(CA) to be larger than zero, indicating a transitive inference from the chain’s initial event A to the final event C. These judgments strongly differed from those in the control conditions without intermediate variable B, in which judgments around zero were obtained. The size of the effect was modulated by the specific testing conditions, with a stronger influence of the categorybased transitive inference when the data were not visible during the judgment. Probability judgments for individual test items were influenced by transitive inferences on the category level and itemspecific knowledge. The results support the idea that participants induced a causal chain A→B→C and reasoned transitively from A to C, although the Markov condition was violated and transitivity did not hold in the data.
Experiment 2
The goal of Experiment 2 was to investigate reasoning with intransitive chains in a wider array of circumstances. In addition to the intransitive chain of Experiment 1 (A→B, B→C, A independent of C, henceforth denoted + + 0) we included a new intransitive chain involving two preventive causal relations (A→¬B, ¬B→C, A independent of C, denoted − − 0). The goal was to rule out that positive judgments for the local relations created a response bias toward a positive rating for the A→C relation. Additionally, we used a new control condition that matched the complexity of the intransitive chains. This involved a positive A→B contingency, followed by independent B and C variables, and likewise independent variables A and C (denoted + 0 0).
We also included two transitive chains that obeyed the Markov assumption (A→B, B→C, A→C, denoted + + +; and A→¬B, ¬B→C, A→C, denoted − − +) as a comparison for the respective intransitive chains (+ + 0 and − − 0). This allowed us to investigate whether participants would use the observable evidence in addition to transitive reasoning to judge the indirect relation between A and C. Finally, we included a Markovcoherent chain with a negative overall relation (− + −) to examine whether people correctly learn a negative overall relationship.
Based on the findings of Experiment 1 we expected distortion effects due to transitive inferences in conditions + + 0 and − − 0. However, we expected to find higher ratings in the respective transitive conditions + + + and − − +, due to an additional influence of the observable positive relation between A and C. In the control condition +00, both transitive inferences and the learning data entailed a zero contingency. Therefore we expected participants to correctly detect A’s and C’s independence.
Finally we controlled for question order. In the local–global conditions, similar to Experiment 1, participants rated the individual causal links before judging the conditional probability of C given A. In the global–local conditions, this order was reversed.
Method
Participants
One hundred twentyfour participants (56% female; M_{age} = 23 years), mostly students from the University of Heidelberg, took part in exchange for chocolate and participation in a lottery where they could win €50. Three participants were excluded from the analyses (two clear outliers in the time used and one who gave a rating of −100 in all local judgments).
Design
Materials and procedure
We used the same scenario and materials as in Experiment 1 (see Fig. 3), but in a computerbased experiment. As in Experiment 1b, judgments about P(CA) were elicited in the presence of the learning data. Since participants were presented with several contingency conditions, we omitted the singleitem test.
Participants were randomly assigned to a question order. In the local–global condition, participants were first shown the A/¬A panel and the B/¬B panel and asked to judge P(BA). Subsequently, the C/¬C panel was added and participants judged P(CB). Finally, they estimated P(CA), with the previous judgments and data remaining visible. In the global–local condition, all three panels were presented from the outset and participants first estimated P(CA). Subsequently, we asked for judgments for P(BA) and P(CA). Finally, we included an itemsensitivity test to make sure that participants were able to distinguish between neighboring feature values of size or brightness (see Appendix 1).
Results
All answers were recoded to match the counterbalancing conditions shown in Fig. 7. We used the same selection criterion as before; that is, participants had to judge the two local links qualitatively correctly (cf. Baetu & Baker, 2009), because this is a prerequisite for testing the impact of transitive reasoning in intransitive chains. Judgments for positive local relations had to be in the interval +20 ≤ x ≤ +100 (i.e., one of the five farthest points to the right on the 11point rating scale); for a negative relation in the interval −100 ≤ x ≤ −20 (i.e., one of the five farthest points to the left); and for the relation with a predicted zero mean in the interval −40 ≤ x ≤ +40 (i.e., one of the five points midscale). These intervals are centered on the predicted values for the respective relations: positive = 50, negative = −50, and null = 0.^{6} Selections were made for each condition separately.
t Tests (onetailed) for judgments of P(CA) against zero in Experiment 2
Condition  P(CA) M (SE)  df  t  p 

+ + 0  16.71 (5.30)  72  3.15  .002 
− − 0  11.81 (5.43)  65  2.16  .033 
+ 0 0  −1.97 (4.62)  60  −0.43  .672 
+ + +  35.29 (4.27)  84  8.27  .001 
− − +  31.85 (5.48)  83  5.81  .001 
− + −  −32.81 (4.85)  63  −6.77  .001 
Analyses of variance (2 × 2) comparing judgments of P(CA) in the withinsubject contingency conditions while controlling for betweensubjects question order
Comparisons  Contingency  Question order  Contingency × Order  N  

F  η_{p}^{2}  p  F  η_{p}^{2}  p  F  η_{p}^{2}  p  
+ + 0  + 0 0  6.16  .12  .017  0.12  .00  .725  0.19  .00  .668  44 
− − 0  + 0 0  2.57  .05  .116  2.42  .06  .127  0.06  .00  .805  42 
+ + 0  + + +  9.61  .13  .003  0.04  .00  .848  0.18  .01  .674  63 
− − 0  − − +  7.66  .16  .008  0.14  .00  .641  0.10  .00  .748  42 
+ + 0  − − 0  0.05  .00  .828  0.40  .00  .529  0.05  .00  .768  45 
+ + +  − − +  0.46  .01  .500  0.22  .00  .995  1.99  .04  .165  47 
To investigate to what extent participants were sensitive to the observed data, ratings in the intransitive conditions (+ + 0 and − −0) were compared to the corresponding transitive conditions (+ + + and − − +; see Table 3, two middle rows). Judgments were higher in the latter cases when P(CA) > .5 than when P(CA) = .5. These results show that judgments in the intransitive conditions were influenced not only by transitive reasoning on the category level, but also by the observed contingencies.
Additionally, we examined potential response biases by comparing the + + 0 condition with the − − 0 condition, and the + + + condition with the − − + condition (see Table 3, bottom two lines). If observing two positive relationships for the direct links or giving two positive judgments creates a tendency to judge the indirect relation positively, too, judgments of P(CA) should differ between conditions. The analyses show that this was not the case: There was no response bias and no effect of question order (see also Fig. 9).
Percentage (and frequency) of participants in Experiment 2 judging the relationship between A and C to be negative, zero, or positive, on a scale of −100 to +100
Contingency condition  Order  Negative  Zero  Positive 

Generative intransitive chain (+ + 0)  Local–global  30% (11)  6% (2)  64% (23) 
Global–local  19% (7)  22% (8)  59% (22)  
All  25% (18)  14% (10)  62% (45)  
Preventive intransitive chain (− − 0)  Local–global  24% (9)  24% (9)  53% (20) 
Global–local  36% (10)  21% (6)  42% (12)  
All  29% (19)  23% (14)  48% (32)  
Neutral control (+ 0 0)  Local–global  25% (8)  41% (13)  34% (11) 
Global–local  44% (13)  28% (8)  28% (8)  
All  34% (21)  34% (21)  31% (19)  
Generative transitive chain (+ + +)  Local–global  14% (6)  7% (3)  79% (35) 
Global–local  10% (4)  10% (4)  80% (33)  
All  12% (10)  8% (7)  80% (68)  
Preventive transitive chain (− − +)  Local–global  11% (3)  7% (2)  81% (22) 
Global–local  22% (6)  0% (0)  78% (21)  
All  17% (9)  4% (2)  80% (43)  
Mixed transitive chain (− + −)  Local–global  71% (25)  11% (4)  17% (6) 
Global–local  86% (25)  10% (3)  3% (1)  
All  78% (50)  11% (7)  11% (7) 
Finally, the itemsensitivity test corroborated a reasonable ability of participants to distinguish between neighboring feature values of the items shown (microbes) and that differences in ability seem not to have driven the distortion effect (see Appendix 1).
Discussion
In Experiment 2 we replicated and extended the results of Experiment 1, while controlling for alternative explanations. In both intransitive chain conditions (+ + 0 and − − 0) participants’ judgments deviated from the observed data, consistent with the idea that people tend to induce Markovcoherent causal chains and use them to reason transitively from A to C. The fact that participants gave higher ratings when reasoning with transitive chains suggests that judgments were also influenced by the learning data. In the transitive condition the ratings even seem a bit too high, but this may relate to the numberless rating scale in this experiment (see also Rehder & Burnett, 2005).
Our analyses also show that the distortion effect in the intransitive conditions cannot be explained by answer tendencies being due to influences from previous judgments or by previous beliefs about the global relation of A and C. First, the +00 control condition yielded judgments close to zero; second, the intransitive − − 0 and + + 0 conditions yielded similar judgments. Third, judgments in the positive (+ + +, − − +) and negative (− + −) transitive conditions had similar absolute positive or negative values, and fourth, the order in which judgments were elicited was irrelevant.
General discussion
Our goal was to investigate whether transitive inferences in probabilistic causal chains of the type A→B→C distort the induction of the relationship between A and C when the transitive inference based on the independent combination of the observed local relationships A→B and B→C, for instance, entails a positive indirect relationship, while the data directly shows that that A and C are independent. We studied the influence of transitive inference in intransitive chains that violated the Markov condition on the category level because heterogeneous subclasses of items were mixed. Our results show that people made judgments about P(CA) that systematically deviated from the observed data but were consistent with a transitive inference from A to C based on a mental causal model (illicitly) obeying the Markov assumption.
Experiment 1a demonstrated the influence of inappropriate transitive reasoning when participants learned consecutively about the two individual links and made judgments after the learning data were removed. Experiment 1b showed that this finding was also obtained when all data were available while judging P(CA), although in this case the judgments were influenced more strongly by the learning data. When the intermediate event B was omitted from the data, participants had no difficulty recognizing that A and C were unrelated. Further analyses showed that judgments on the level of individual items were influenced by transitive inferences on the category level and the itemlevel relations.
Experiment 2 investigated the robustness of these findings while controlling for alternative explanations, such as task complexity and possible answer tendencies. Similar findings to Experiments 1a and 1b were obtained. A direct comparison of intransitive versus transitive chains showed that participants’ judgments on the category level were influenced not only by illicit transitive reasoning but also by the observed data. Results in the different control conditions refute the idea that these distortion effects are due to answer tendencies resulting from previous judgments (cf. atmosphere effects in syllogistic reasoning; Seels, 1936) or to prior beliefs concerning the global relation. Judgments of the indirect relation were correct and independent of question order in Markovcoherent, transitive conditions: positive for generative as well as preventive relations, negative for mixed relations, and zero in the neutral control condition.
The present research goes beyond previous studies (Ahn & Dennis, 2000; Baetu & Baker, 2009) that investigated the influence of transitive reasoning in the absence of data about A and C, so that learners could not assess whether the Markov condition held true. While these studies demonstrated that people made transitive inferences in the absence of data on the relation between A and C, our results suggest they do so even in the presence of counterevidence. Although participants could observe the indirect relation between A and C, judgments were substantially influenced by transitive reasoning.
Should one assume transitivity and the Markov condition when inducing causal structures?
Cartwright (2001, 2002, 2007) criticized the concept of assuming the Markov condition as a universal property of causal relations in the world. Even proponents of a universal assumption of the Markov condition concede that the condition need not hold for inadequate category schemes or incomplete causal structures actually in use (Hausman & Woodward, 1999, 2004; Spohn, 2001). Inspired by these ideas, we investigated the influence of transitive reasoning in intransitive chains that violate the Markov condition.
In our scenarios the Markov condition is violated due to mixing subclasses of items with different contingencies. Aggregating these subclasses into the same category results in a violation of the Markov condition and of transitivity, given the provided categories. However, categories play an indispensable role in causal induction and causal reasoning, as causal relations are typically defined on the category level (Lien & Cheng, 2000; Waldmann & Hagmayer, 2006; Waldmann, Meder, von Sydow, & Hagmayer, 2010; also Hagmayer, Meder, von Sydow, & Waldmann, 2011). Even if one assumes that causal relationships at a more finegrained level adhere to the Markov condition, there is no guarantee that this is the case for a given category scheme. One rarely knows whether categories are homogeneous, and causal relationships may often involve mixtures of different causal relationships at some lower level or involve hidden variables. Thus it seems plausible for transitive distortion effects to play a substantial role in everyday as well as scientific reasoning.
Do our findings show that people’s probabilistic inferences are generally flawed and error prone? The results do show that transitive reasoning that assumes an independent integration of causal links can systematically deviate from objective data. Yet every cognitive system needs to make inductive inferences about unobserved relations, and the virtue of the Markov condition is that it enables such inferences (Pearl, 2000; Spirtes et al., 1993). Moreover, the independence assumptions formalized in the Markov condition facilitate a parsimonious representation of relationships between variables (Domingos & Pazzani, 1997). Thus the Markov condition may provide a reasonable default assumption that guides human learning at least initially, even if the assumption does not hold (von Sydow et al., 2010; Jarecki, Meder, & Nelson, 2013). Although the Markov condition does not need to hold for the categories we used, it may provide reasonable guidelines for an ideal construction of causal relationships and categories (Hausman & Woodward, 1999, 2004; but cf. Cartwright, 2007). We focused here on chains where we find this idea convincing. Even if transitive distortion effects show that the independent integration of single links may lead learners astray, this is taken as support for the idea that people tend to assume that causal chains are transitive. Apart from resolving such situations by differentiating categories into different subclasses—for which we here found only weak evidence—–a further way to prevent intransitive chains is that people may already induce categories in a way that allows for transitive reasoning (Hagmayer et al. 2011).
Transitive reasoning in causal chains: boundary conditions and future directions
Our findings suggest several avenues for future research. A key question concerns the boundary conditions for illicit transitive reasoning.
One way to eliminate transitive distortion effects for a chain with two nonzero local relations and a zero global contingency may be to highlight a possible direct relation between A and C. Although the temporal order of events in our experiments constrained the set of plausible causal models (Lagnado & Sloman, 2006), such constraints do not exclude a chain with an additional direct link between A and C, rather than assuming that these variables were only indirectly connected via intermediating event B. Our results suggest that people tend to induce a parsimonious chain model without an additional link. Future research should investigate whether better calibrated judgments are obtained if one would explicitly point out alternative causal structures.
The obtained distortion effects might have been caused by a focus on causal relations, and might have been attenuated by a focus on the involved categories. In fact, research on causalbased category induction (Lien & Cheng, 2000; Waldmann & Hagmayer, 2006; Waldmann et al., 2010) suggests that people can use causal information to induce categories. The results of Hagmayer et al. (2011) suggest that people tend to continue to use categories from earlier nodes in a chain. They investigated the transfer of category schemes when learning causal chains A→B→C, where the dichotomous events A and C were precategorized but the intermediate event B consisted of uncategorized exemplars. They showed that the categorization of B based on A was subsequently used for the second causal relation, even if not optimal. In our task all three events were precategorized, and no transfer of categories occurred—otherwise participants would have realized that A and C were orthogonal. Nonetheless, tasks that focus more strongly on categories than on relationships between categories might reduce transitive distortion effects.
Similarly, an emphasis on different subclasses of items with different contingencies (mixing) might reduce transitive distortion effects. Although we used only a twodimensional item space and—for the intransitive chains—deterministic relationships on the subclass level, a stronger emphasis on the existence of subclasses, communication of several different causes for categories (Bonnefon et al., 2012), or an even simpler item space (see Fig. 1) might reduce distortion effects.
The semantics and pragmatics of scenarios and judgments appear to provide an additional important dimension relevant to issues of causal intransitivity. For example, being hungry (A) causes one to eat (B), which in turn causes one to feel full (C). Here, a transitive inference would suggest, counterintuitively, that being hungry first causes one to feel full. In fact, research on verbally communicated causal relations suggests that people do regard some causal chains as intransitive (Bonnefon et al., 2012; Mayrhofer, Hildenbrand, & Waldmann, 2013). Future research should aim to investigate the conditions under which chains are considered to be transitive.
A further important direction for future research concerns the relationships to different models of (causal) learning. While our investigation of intransitive causal chains was motivated by the postulated central role of the Markov condition in causal Bayes nets (Cartwright, 2002, 2006; Hausman, & Woodward, 1999, 2004; Spohn, 2001; cf. Mayrhofer & Waldmann, 2015; Rehder & Burnett, 2005), an important question is to what extent associative models of learning could account for our findings. Although associative and causal learning differ with respect to important normative and descriptive issues (Goedert, & Spellman, 2005; Waldmann, 1996), there is also some overlap and convergence between associative and probabilistic models of contingency judgment (Chater, 2009; De Houwer & Beckers, 2002; Mitchell, De Houwer, & Lovibond, 2009; Pineño & Miller, 2007). For instance, in line with Marr’s (1982) distinction between computational and algorithmic models, the Rescorla–Wagner model of associative learning (Rescorla & Wagner, 1972) converges under specific circumstances on the probabilistic contrast ΔP (Jenkins & Ward, 1965), a prominent measure of statistical contingency or causal strength (Chapman & Robbins, 1990; Chater, 2009; Cheng, 1997; Danks, 2003; Griffiths & Tenenbaum, 2005). Regarding our transitive distortion effects, associative approaches that model updating of associative strength in a pure bottomup fashion based on directly observable contingencies between events cannot explain our results. However, associative approaches that additionally model “inferred” associations may be able to account for our results (e.g., Baetu & Baker, 2009). Future research on transitive reasoning should aim to investigate the different models and to characterize the relationships among them.
Finally, an important question is whether inferential distortion effects are restricted to causal chains. There is some preliminary evidence that they do not generalize to commoneffect structures (A→B←C) with similar positive local A→B and B←C contingencies and zero contingencies between A and C (von Sydow et al., 2010). This would be expected from a causal Bayes net perspective. Another question concerns commoncause structures, which play a central role in the philosophical criticism of the Markov condition (Cartwright, 2007; Salmon, 1978; Sober, 1987; cf. Hausman & Woodward, 1999, 2004). According to Bayes nets, causal chains and commoncause structures are “Markov equivalent.” This suggests identical inferential distortion effects for both structures. Empirically, however, the evidence on the direct psychological validity of the Markov condition for commoncause structures is mixed (Rehder & Burnett, 2005; see also Jarecki et al., 2013; Mayrhofer, Goodman, Waldmann, & Tenenbaum, 2008; Mayrhofer & Waldmann, 2015; Rottman & Hastie, 2013; von Sydow, 2011, 2013). Future research should compare reasoning with different causal structures when the data violate the Markov assumption (von Sydow et al., 2010).
Relations to and differences from other research
Although our results are novel in the causal domain, related findings in other fields point in a similar direction. For example, the Simpson paradox (Simpson, 1951) describes how statistical dependencies can vanish or even be reversed when moving from populations to subpopulations. Some studies (Fiedler, Walther, Freytag, & Nickel, 2003; Waldmann & Hagmayer, 2001) have demonstrated participants’ problems in adequately controlling for a third, confounding variable that reverses the relation between two events. Whereas in our experiments participants integrated individual causal links and thereby misjudged the distal relation, participants in the mentioned experiments integrated subpopulations, violating the relation among variables in the overall population.
Other research has shown distortion of zero cue–outcome contingencies, based on high or low base rates of an outcome (Baker, Berbrier, & ValléeTourangeau, 1989, Experiment 3; Dickinson, Shanks, & Evandon, 1994; also Buehner, Cheng, & Clifford, 2003). Our results are neutral with regard to such an “outcome density bias,” because we used no skewed outcomes that is, P(A) = P(B) = P(C) = .5, and, empirically, we did not find distortion effects in the zerocontingency control conditions.
Furthermore, socalled pseudocontingencies have been discussed (Fiedler & Freytag, 2004; Fiedler, Freytag, & Meiser, 2009; Fiedler, Kutzner, & Vogel, 2013; Meiser & Hewstone, 2004; cf. Kutzner, Vogel, Freytag, & Fiedler, 2011), normally referring to illicit inferences about relations between events based on skewed marginal distributions. For instance, when many students in one class watch a lot of television, and many students in the same class show aggressive behavior, one might infer that students who watch a lot of television tend to be aggressive, even if the events are not correlated. Such pseudocontingencies, however, are unlikely to apply in our scenarios, as our distributions (including the ones with zero contingency) were not skewed.
Concluding remarks
Our results contribute to a view that emphasizes the role of topdown or knowledgebased inference processes in induction. It has been argued in different fields, such as perception (Gregory, 1980), memory (Loftus & Hoffman, 1989), and language comprehension (Graesser, Singer, & Trabasso, 1994), that topdown processes favoring broadly coherent representations have a substantial and occasionally distorting impact on induction. Overall, our results corroborate the idea that people derive probability estimates by combining single causal links into complex causal models in a modular way (Waldmann et al., 2008). The present results, however, suggest that people base their probability judgments in causal structures not only on bottomup data—even if observations are directly available during judgments—but also on transitive inferences based on mental causal models that obey the Markov condition, even if transitivity does not hold in the data.
Likewise, causal strength measures such as ΔP (Allan, 1980; Jenkins & Ward, 1965; cf. White, 2003) or causal power (Cheng, 1997; Griffiths & Tenenbaum, 2005; Meder, Mayrhofer, & Waldmann, 2014) indicate that the individual links are positive, whereas the causal strength for the indirect relation of A and C is negative when marginalizing over B.
Conditional probabilities are simple, uncontroversial measures (Evans & Over, 2004; Oberauer, Weidenfeld, & Fisher, 2007). Measures of causality (ΔP, causal power) yield qualitatively similar predictions for the investigated contingencies.
The predictions for people who failed to meet the selection criteria are not clear, since they may have failed for various reasons, such as a lack of concentration or because they recognized the intransitivity. To explore whether participants used the local relations as predictors, we correlated the empirically found estimates of P(CA) with the product P(CB) × P(BA) and with P(CB) × P(BA) + (1P(CB)) × (1P(BA)). The latter estimate relies on a symmetry assumption in line with the observed data: P(CB) = P(¬C¬B). In Experiment 1a, both correlations were positive and large when we included all participants (r = .53 and r = .55) or only those meeting the selection criterion (r = .63 and r = .63). In Experiment 1b, in contrast, we obtained no correlation when considering all participants (r = .12 and r = .08) but strong positive correlations for participants meeting the criterion (r = .55 and r = .53). This suggests that people in the group meeting the selection criterion were often guided by transitivity, but that at least in Experiment 1b many of the participants failing to meet the criterion cannot be modeled by transitivity.
Additionally, we controlled for the two patterns of onedimensional (1D) and twodimensional (2D) category boundaries used in the eight control conditions (1D–2D–1D vs. 2D–1D–2D categorization type). Descriptively, the distortion effect was larger in the 2D–1D–2D condition. However, a twoway ANOVA concerning the P(CA) judgments showed a significant main effect only of experimental condition vs. control, Experiment 1a, F(1, 44) = 19.50, p < .001; Experiment 1b, F(1, 49) = 5.15, p < .05. There was no effect of categorization type, Experiment 1a, F(1, 44) = 1.7, p = .20; Experiment 1b, F(1, 49) < 1, p = .99, and no interaction, Experiment 1a, F(1, 44) < 1, p = .94; Experiment 1b, F(1, 49) = 1.43, p = .24.
With 40 microbes, some conditions only approximate the Markov condition. We did not increase the number of items in order to retain comparability with Experiment 1 and to ensure that the task did not become more difficult.
Across all conditions, an average of 29% of answers concerning local relations fell into the six excluded levels of the scale (out of 11 levels).
We also tested if the difference between the categorization types modulated the size of the distortion effect. There was no significant difference in the two relevant intransitive conditions, ++ 0: F(1 ,71) = 2.55, p = .11 and −− 0: F(1, 64) = 3.35, p = .07, but descriptively the distortion effect was larger in the 2 dimension1 dimension2 dimension (2D–1D–2D) condition.
We did not conduct a global ANOVA on question order because applying the selection criterion of qualitatively correct local judgments to all conditions simultaneously would have excluded too many participants.
Note that this test has a lower statistical power than the test against zero, because applying the selection criterion to both conditions lowered the number of participants involved.
However, for the preventive intransitive chain condition this difference seems to have been driven mainly by the local–global condition where the difference between positive and negative answers became significant (exact binomial test, p < .05).
Acknowledgments
The work of M. v. S. and the running of the experiments were supported by a grant from the Deutsche Forschungsgemeinschaft (DFG Sy 111/2), as part of the priority program “New Frameworks of Rationality” (SPP 1516). B. M. was supported by grant ME 3717/2 from the same program. Portions of Experiments 1a and 1b were presented at the 2009 Cognitive Science conference in Amsterdam (von Sydow, Meder, & Hagmayer, 2009). We thank Alexander Wendt, Antonia Lange, Alina Greis, Christin Corinth, and Martine Vardar for assistance in data collection and Anita Todd and Martha Cunningham for correcting the manuscript. We are grateful to Ben Newell, Dennis Hebbelmann, Klaus Fiedler, Martha Cunningham, Ralf Mayrhofer, and Michael R. Waldmann for helpful comments on this research.
Funding information
Funder Name  Grant Number  Funding Note 

Deutsche Forschungsgemeinschaft 
