The (standard and strict) material accounts of indicative conditionals have a well-known problem: the truth (or known truth, for the strict material account) of the consequent is sufficient to warrant the truth/acceptability of the conditional. As such, these theories have a hard time explaining what is wrong with a conditional like

  1. (1)

    If it is sunny today, Ajax won the Champions League in 1995.

especially given that Ajax won the Champions League in 1995. More modern theories of conditionals such as the similarity-based account (e.g., Stalnaker 1968), the information-based account (e.g., Veltman 1985), or the probabilistic account (Adams 1965; Stalnaker 1970) still have a similar problem: if the antecedent and consequent are (known to be) true, the conditional is predicted to be true, or acceptable, as well. None of these analyses account for the intuition many people have that the truth or acceptability of a conditional relies on a dependence of the consequent on the antecedent. Other theories do demand a link between antecedent and consequent. Relevance theorists (Anderson and Belnap 1962; Urquhart 1972; Restall 1996) claim that the link should be either one of overlapping aboutness, or of the use of (part of) the antecedent for the proof of the consequent, while others (Krzyżanowska et al. 2013; Douven et al. 2018) claim that for acceptability of an indicative conditional we have to be able to infer the consequent from the antecedent.

In this paper we argue that the problem of the missing link (cf. Douven 2008) is not restricted to ‘standard’ conditionals like (1). Also for biscuit conditionals and conditional threats and promises, for instance, there should be a link between antecedent and consequent for the conditional to be appropriate. This already suggests that if we want a more uniform analysis of (indicative) conditionals, the above theories that account for a link are not general enough. We propose such a more uniform analysis of (indicative) conditionals that accounts not just for ‘standard’ (indicative) conditionals, but for other types of indicative conditionals as well. The analysis builds on the notion of ‘contingency’, or ‘relevance’, that was already used by Douven (2008) to explain what is wrong with ‘missing link’ conditionals. The notion of contingency was originally introduced in learning psychology to measure the learnability of a dependence between the two features. We will propose that for the appropriateness of a conditional, the conditional probability of the consequent on the antecedent has to be weighted by their contingency (basic proposal). This will be shown to solve the ‘missing link’ problem discussed above (application 1). Drawn on experimental results on learning, we will motivate an extension of contingency to what we will call ‘representativenes’ of C for A. This extension will allow us to account for biscuit conditionals (application 2), conditional threats and promises (application 3), and perhaps even anankastic- and ‘even if’-conditionals (applications 4 and 5). We will consider in how far this proposal can serve as a general analysis of the meaning of conditional sentences. Finally, we will point out the close link the present analysis draws between conditionals and generic sentences (application 6). In the final main section, we provide more detail of how our notion of representativeness is related with learning, and explain why representativeness is often confused with probability.

From Learning to Representativeness

Within classical learning-by-conditioning psychology, learning a dependency between two events C and A is measured in terms of the contingency \(\Delta P^ {C}_{A}\) of one event on the other (cf. Rescorla 1968):Footnote 1

$$\begin{aligned} \Delta P^ {C}_{A} = P(C|A) - P(C|\lnot A), \quad \hbox{where}\,P\,\hbox{measures frequencies}. \end{aligned}$$

Contingency not simply measures whether the probability of C given A is high, but whether it is high compared to the probability of C given all other (contextually relevant) cases than A (\(\lnot A\) abbreviates \(\bigcup {\textit{Alt}}(A)\)). Thus, it is measured how representative or typical C is for A. Rescorla (1968) showed that rats learn a \(\langle {\textit{tone}}, {\textit{shock}}\rangle\) association if the frequency of shocks immediately after the tone is higher than the frequency of shocks undergone otherwise, even if shocks occur only in, say, 12% of the trials in which a tone is present.Footnote 2 Gluck and Bower (1988) show that contingency is crucial for human associative learning as well.

Experiments in the aversive (i.e., fear) conditioning paradigms (e.g., Annau and Kamin 1961; Forsyth and Eifert 1998) show that the speed of acquisition and the strength of the association in rats increases with the intensity of the shock. Slovic et al. (2004) show, similarly, that people build stronger associations related to events with high emotional impact. To capture this we introduce a new measure, the representativeness \(\nabla P^{C}_{A}\), defined as below, where V(C|A) measures the absolute value (or intensity) of C given A.

$$\begin{aligned} ({\textit{RES}})\quad \nabla P^{C}_{A} = P(C|A) \times V(C|A) - P(C|\lnot A) \times V(C| \lnot A). \end{aligned}$$

The value of C given A measures something like (the absolute value of) a conditional utility, or conditional preference. Although in some applications we assume that \(V(C|A) = V(C)\), in other applications the conditionality of the utility is important.Footnote 3 Although somewhat unusual, conditional utilities have been used before, e.g., by Armendt (1988). For Armendt, V(C|A) measures the present utility for C under the hypothesis A, which need not be identical to the utility the agent would have if A were true, or if (s)he came to believe A. We (mostly) think of (conditional) utilities as experienced utilities as originally thought of by Bentham (1824/1987). Although for a long time such experienced utilities were thought of as unmeasurable and thus unscientific, this opinion changed significantly more recently: several measures (some involving dopamine) are used nowadays to measure experienced joy and (experienced) fear within conditioning psychology, while due to the work of Kahneman and his collaborators (e.g., Kahneman et al. 1997), experienced utility became a respectable notion even in economics. By making use of experienced instead of revealed utilities, we propose to make a link between standard decision theory and the use of intensity in learning-by-conditioning psychology. We will assume that in many circumstances, or that per default, \({\textit{Value}}(C|A) = 1 = {\textit{Value}}(C|\lnot A)\), meaning that under normal circumstances our notion of representativeness reduces to contingency, \(\nabla P^C_A = \Delta P^C_A\).

Conditionals As Representative Inferences

\(\Delta P^ {C}_{A}\) is a measure of the probabilistic dependence between C and A. To overcome the missing link problem of approaches to indicative conditionals of the form \(A \Rightarrow C\), one might therefore suggest to use \(\Delta P^ {C}_{A}\) or (RES) to check the acceptability of a conditional sentence. Indeed, Douven (2008) uses the measure \(P(C|A) - P(C)\) for these purposes, and it is easy to prove that \(P(C|A) > P(C)\) iff \(\Delta P^C_A > 0\).Footnote 4 An advantage for using \(\Delta P^C_A\) is that this measure has the maximal value, i.e., 1, if and only if \(P(C|A) = 1\) and \(P(C|\lnot A) = 0\). But this holds exactly whenever ‘If A, then C’ is strengthened to ‘A if and only if C’, a strengthening often observed for indicative conditionals under the name of ‘conditional perfection’ (cf. Geis and Zwicky 1971). However, Skovgaard-Olsen et al. (2016) show that although \(\Delta P^ {A}_{C} > 0\) is a necessary condition for acceptability of indicative conditionals, it is not a sufficient one: it is also demanded that P(C|A) is high. To account for that, one can make use of the following condition:Footnote 5

$$\begin{aligned} ({\textit{CON}}')\quad A \Rightarrow C\,\hbox{is acceptable} \quad \hbox{iff}\quad \frac{\Delta P^C_A}{1 - P(C|\lnot A)}\,\hbox{is high.} \end{aligned}$$

This latter measure is known in the literature as the measure of relative difference (Shep 1958). Cheng (1997) uses it to measure causal strength and shows that for this measure, P(C|A) counts for more than \(P(C|\lnot A)\). This captures part of the intuition that for \(A \Rightarrow C\) to be acceptable, it should (normally) be the case that P(C|A) is high. (see Sects. 4 and 5 for more on this).

For the general case, however, we should not look only at informational value: utility, or emotional value, counts as well. Therefore, we propose the following generalization of (\({\textit{CON}}'\)) as our general condition (with \({\textit{EV}}(C|\lnot A)\) as an abbreviation for \(P(C|\lnot A) \times V(C| \lnot A)\)):

$$\begin{aligned} ({\textit{CON}})\quad A \Rightarrow C \hbox{ is acceptable } \quad \hbox{ iff } \quad \frac{\nabla P^C_A}{\max \{1, V(C|A)\} - {\textit{EV}}(C|\lnot A)}\hbox{ is high.} \end{aligned}$$

Notice that if \({\textit{Value}}\) is irrelevant (meaning that \(V(C|A) = V(C|\lnot A) = 1\)), for acceptability it is a necessary condition that \(\Delta P^C_A > 0\). Moreover, under these circumstances, \(({\textit{CON}})\) comes down to the simpler condition (\({\textit{CON}}'\)) above.


Application 1: The Missing Link Problem

Already contingency accounts for conditionals like (1) (cf. Douven 2008; Skovgaard-Olsen 2016; Skovgaard-Olsen et al. 2016). If antecedent and consequent are probabilistically independent, we get \(\Delta P^ {A}_{C} = P(C|A) - P(C|\lnot A) = 0\). If \({\textit{Value}}\) doesn’t count, it follows from independence that \(\Delta P^ {C}_{A} = \nabla P^C_A = 0\). Hence, we predict that even in case \(P(C)=1\) (and perhaps \(P(A) = 1\)) and, therefore, \(P(C|A) = 1\), the conditional (1) is not appropriately acceptable. As noted above, we believe that contingency is not the appropriate measure to account for indicative conditionals: P(C|A) should count for more than \(P(C|\lnot A)\). For this reason, (\(CON'\)) seems to be preferred to contingency. But there is more that speaks in favor of (\(CON'\)): As shown by Cheng (1997) and Pearl (2000), \(\frac{\Delta P^C_A}{1- P(C|\lnot A)}\) follows from a causal analysis (under some natural assumptions). Cheng calls the measure ‘causal strength’, while Pearl (2000) refers to the measure as the ‘probability of causal sufficiency’. By thinking of things in this way, what is missing in missing link conditionals, is a causal connection between antecedent and consequent, or so van Rooij and Schulz (2019a) argue. van Rooij and Schulz (2019a) use this causal view behind the measure \(\frac{\Delta P^C_A}{1- P(C|\lnot A)}\) also to show that under various natural circumstances (e.g., if A is (thought to be) the only cause of C, or if the potential causes of C are mutually inconsistent), acceptability of conditionals can be measured by conditional probability, suggesting that the original proposals of Adams (1965) and Stalnaker (1970) were not far off.Footnote 6

Application 2: Biscuit-Conditionals

To account for missing-link conditionals we argued that the value of P(C|A) should be higher than that of \(P(C|\lnot A)\) (or of P(C)). But there are obvious exceptions to this. Most prominently: Austin’s (1961) biscuit conditionals:Footnote 7

  1. (2)

    a.   There are biscuits on the sideboard, if you want some.

    b.   If you are interested, there’s a good documentary on BBC tonight.

    c.   If you need help, my name is Sue.

Iatridou (1991) and others claim that in a biscuit conditional, the if-clause specifies the circumstances in which the consequent is relevant. DeRose and Grandy (1999) seek to account for this by proposing a conditional assertion analysis of biscuit conditionals. According to such an analysis (cf. de Finetti de Finetti 1936/1995; Belnap 1970), the conditional ‘If A, C’ states that C is true, if A holds, and doesn’t say anything otherwise. Belnap (1970) himself, however, already argued against such an analysis for biscuit conditionals:

But I do know that “There are biscuits on the sideboard if you want some” is not generally used as a conditional assertion; for if there are no biscuits, even if you don’t want any, it is plain false, not nonassertive. (Belnap 1970, p. 11).

We agree with Belnap’s intuition. Franke (2007) argues that semantically speaking, biscuit conditionals could just be analyzed as material or strict implications. He proposes to use pragmatics (using a qualitative or quantitative notion of independence), instead, to explain why (2-a), for instance, entails that there are biscuits on the sideboard. This proposal is certainly appealing, but as noted by Lauer (2015), this analysis by itself still leaves open what it is that makes the antecedent relevant to the consequent. Indeed, what we need is both (i) epistemic independence (e.g., \(P(C|A) = P(C)\) and thus \(\Delta P^A_C = P(C|A) - P(C| \lnot A) = 0\)), without giving up that (ii) the antecedent is still of value to the consequent. Our analysis (CON) captures this.

To see this, notice that in the relevant situation the biscuits are on the sideboard, independently of whether you want some or not. Thus \(\Delta P^C_A = P(C|A) - P(C|\lnot A) = 0\). What makes the antecedent still of value for the consequent? Right, high V(C|A)! If you want biscuits, it is important to know that the biscuits are easy to take: they are just there on the sideboard. Similarly for (2-b)–(2-c). Thus, for biscuit conditionals the \({\textit{Value}}\) in the definition of representativeness \(\nabla P^{C}_{A}\) matters. In (2-a)–(2-c), learning the truth of the consequent is of little or no value if the antecedent is false, but this value is high if the antecedent is true. Hence, \(V(C| A)>\!\!> V(C| \lnot A) \approx 0\). As a result, \(\nabla P^C_A = P(C|A) \times V(C|A) - P(C|\lnot A) \times V(C|\lnot A)\) will be high, and this explains the appropriateness of the conditional.

Notice that in (CON) we used \(\max \{1, V(C|A)\} - EI(C|\lnot A)\) in the denominator, and not simply \(1 - P(C|\lnot A)\). Although the former comes down to the latter in natural circumstances—i.e., when \(V(C|A) = V(C|\lnot A) = 1\)—, it is crucial for biscuit conditionals that we used the more general formula. The reason is that for biscuit conditionals \(P(C|\lnot A) = 1\), meaning that \(1 - P(C|\lnot A) = 0\) and thus that the fraction would not be defined if we used \(1 - P(C|\lnot A)\) as denominator. As we noticed above, for biscuit conditionals it might be that \(V(C| \lnot A) = 0\) and \(V(C|A) = 1\), meaning that \(\frac{\nabla P^C_A}{\max \{1, V(C|A)\} - EI(C|\lnot A)}\) reduces in those cases to \(\frac{P(C|A) \times V(C|A)}{V(C|A)}= P(C|A)\), which, in turn, typically will have value 1 for a good biscuit conditional. We have seen already that if \(V(C|A) = V(C|\lnot A)\), it will be the case that \(\nabla P^C_A = 0\), because for biscuit conditionals \(P(C|A) = P(C|\lnot A)\), and thus that the conditional is unacceptable.

Application 3: Conditional Threats and Promises

Our analysis works, or so we think, also for conditional threats, promises and warnings:

  1. (3)

    a.   If you won’t give me your wallet, I will kill you.

    b.   If you give me 10.000 euros, I will destroy the (for you hazardous) tapes.

    c.   If you go to New York, watch out for the taxi drivers.

We take it (following Schelling, 1960 and many others) that conditional threats and promises are used strategically in order to influence the hearer’s behaviour: the speaker wants the addressee to give him (or her) the wallet or the 10.000 euros, and the threat and promise states what the speaker will ‘offer’ in return. What needs to be explained for such conditionals is that addressees many times ‘accept’ them, although these threats and promises are not very credible (cf. Schelling, 1960; Hirschleifer 1991). Would it really be rational for the threatener to kill the addressee if the latter doesn’t give the former his or her wallet? And once (s)he has the 10.000 euros in his pocket, why would the promiser still destroy these valuable tapes? Thus, although the speaker of (3-a) and (3-b) seems to commit him or herself to a particular action conditional on the antecedent, why should (s)he stick to his or her commitment?

Indeed, for the addressee P(C|A) is typically not very high.Footnote 8 However, for both (3-a) and (3-b), the probability of the consequent given \(\lnot A\), \(P(C|\lnot A)\) will certainly not be higher than given A (certainly if the speaker is, or pretends to be, desperate or irrational enough). As a result, \(P(C|A) - P(C|\lnot A) > 0\). On our analysis this is not enough for the conditionals to be acceptable. What we need for that is that the value of C (given A), V(C|A), is high.Footnote 9 It is natural to assume that in these conditionals, the emotional impact of the consequent is independent of the antecedent. Thus, representativeness reduces to \(\Delta P^{C}_{A} \times V(C)\). Given that in these cases V(C) is extremely high for the addressee, it follows that \(\nabla P^C_A\) will be high, even if \(P(C|A) - P(C|\lnot A)\) is low. Thus, these conditional threats/promises are accepted, as long as the stakes communicated in the consequent are high enough.Footnote 10

The reader must have noticed that for our analysis of conditional threats and promises, it is the addressee’s probabilities and utilities that count, not those of the speaker, as is normally assumed for analyses of indicative conditionals. Indeed, we think that in contrast to standard (indicative) conditionals, the addressee’s attitudes are crucial to account for the acceptability of conditional threats and promises. One might wonderFootnote 11 to what extent one can then still speak of a (more) ‘uniform’ analysis? We think that our analysis of conditional threats and promises is still part of a uniform analysis, if we take seriously the use of ‘you’ in the antecedent of the conditionals. What this indicates, or so we would like to propose, is that the perspective is shifted from the speaker to the addressee. We don’t have a worked-out theory of when and how such a shift of perspective will take place, but it seems natural to us that such a shift is needed to account for conditional speech acts like (3-a) and (3-b).

What about conditional warnings? For these, it seems it is the difference between V(C|A) and \(V(C|\lnot A)\) that counts. The speaker of (3-c) seems to intend to communicate that it is useful for the addressee to know that taxi drivers are more dangerous in New York city that in the addressee’s hometown.

Applications 4 and 5: Anankastic and Even-if Conditionals

According to Kratzer (1991) (following Lewis 1975), conditional sentences of the form ‘If A, then C’ should be represented logically by ‘Quantifier + if A, C’. A logical form like ‘Most + If A, then C’ and ‘Must + If A, then C’ are then interpreted roughly as follows: ‘for most of the (selected) worlds in which A is true, C is true as well’, and ‘in all (selected) worlds in which A is true, C is true as well’, respectively. One serious challenge for this analysis are so-called ‘Anankastic’ conditionals like the following:

  1. (4)

    a.   If you want to go to Harlem, take the A-train.

    b.   If you want sugar in your soup, ask the waiter.

Intuitively, (4-a) is true, or appropriate, just in case taking the A-train is the best, or most useful, way to go to Harlem. This intuition is captured to a large extent by saying that in all (selected) worlds in which you go to Harlem, you take the A-train. Unfortunately, this is not what Kratzer’s analysis predicts. Her theory predicts that (4-a) is true just in case in all (selected) worlds in which you want to go to Harlem, you take the A train. Thus, for a Kratzer-like analysis of conditionals, the problem is one of compositionality: how to ‘get rid’ of the ‘want’ in the antecedent of the conditional (cf. Saebo 2001)? There is no shortage of proposals of how this should be done, but only seldomly, if ever, the similarity is observed, or made use of, between anankastic conditionals, on the one hand, and biscuit conditionals like (2-a), on the other.

Our (rather provisional, to be honest) analysis is different from Kratzer’s. We would analyse anankastic conditionals similarly as we treated biscuit conditionals: the consequent is relevant for the hearer only in case the antecedent holds: if you want to go to Harlem, or sugar in your soup. Thus V(C|A) should be high, or at least much higher than \(V(C|\lnot A)\). Of course, on our analysis we should take P(C|A) and \(P(C|\lnot A)\) into account as well. But then, anankastic condtitionals are typically, if not always, used to give an advice. Typically, (4-a) is given as answer to a question like ‘Which train should I take if I want to go to Harlem?’ A questioner like that has little or no idea what is the best train to take, so the difference between P(C|A) and \(P(C| \bigcup {\textit{Alt}}(A))\) is rather small, where \({\textit{Alt}}(A)\) are the alternative destinies.Footnote 12 Thus, \(P(C|A) - P(C|\lnot A)\) is small, that is, not high enough for making the conditional acceptable. What makes the conditional acceptable is the difference between V(C|A) and \(V(C|\lnot A)\), just like in the case of biscuit conditionals.

Skovgaard-Olsen et al. (2016)’s experiments suggest that relevance, or positive \(\Delta P^C_A\), is necessary for ‘ordinary’ indicative conditionals, but not for so-called ‘even if’-conditionals like

  1. (5)

    Mary comes, even if John comes.

According to them, the acceptability of ‘even if’ conditionals ‘goes with’ the corresponding conditional probability. We have argued above that under specific conditions our general measure \(\frac{\nabla P^C_A}{\max \{1, V(C|A)\} - EI(C|\lnot A)}\) comes down to the conditional probability P(C|A). The most relevant case for our purposes seems the case where \(V(C|\lnot A) = 0\) (and \(V(C|A) = 1\)). Perhaps this is what is going on in ‘even if’-conditionals like (5): we don’t care whether Mary comes if John doesn’t come, presumably because we know already that she would come in that case anyway.Footnote 13 The only interesting case is the one where John comes. Thus, under this proposal, ‘even if’-conditionals have a lot in common with biscuit conditionals, although it doesn’t have to be the case that \(P(C|A) = 1\).

Application 6: Generics

Generics and conditionals are much alike. They both have at least the following purposes: (i) to state (inductive) generalizations (‘Tigers are striped’, ‘If you push this button, the lamp will light’); (ii) to express (perhaps desired) norms (‘Boys don’t cry’, ‘If you see a general, you salute him’), and (iii) to express threatening cases like (iii) ‘Pit bulls are dangerous dogs’ and ‘If you don’t give me your wallet, I will kill you’. This suggests that they should be given very similar analyses. Indeed, just like there exists the missing-link problem for conditionals, generics of the form ‘As are C’ also seem to be acceptable (under normal conditions) only if being an A is relevant for having feature C. To show this, the following generic is generally taken to be inappropriate, because Germans are not special in terms of right-handedness:

  1. (6)

    ?Germans are right-handed.

As it turns out, in van Rooij (2017) and van Rooij and Schulz (2019b) (building on Cohen (1999) and Leslie (2008)) an analysis of generic sentences in terms of representativeness was indeed proposed: A generic sentence of the form ‘As are C’ was proposed to be true, or acceptable, iff C is a representative feature of As. It is shown that in terms of this analysis quite a number of examples can be accounted for that are problematic for more standard semantic analyses of generic sentences making use of conditional probability or normality. For instance, this analysis immediately accounts for generics like ‘Ticks carry the Lyme disease’ or ‘Sharks attack swimmers’ that are problematic for default-based approaches (e.g., Asher and Morreau 1995) and called ‘striking generics’ by Leslie (2008), who notes that ‘striking’ often means ‘horrific or appalling’. Observe that in case all features are equally important, it is predicted that a generic of the form ‘As are C’ is true iff \(\Delta P^C_A\) is high, from which it follows that \(\Delta P^C_A > 0\), which is exactly what Cohen (1999) demands for so-called ‘relative generics’ (e.g., ‘Dutchmen are good sailors’) to be true. Making use of \(\Delta P^C_A\) one can explain, for instance, why the generic ‘Ducks lay eggs’ is predicted to be ok, although the majority of ducks don’t lay eggs, and why (6) is a questionable generic, although most germans are right handed.

However, this analysis accounts as well for the intuition that standard generics like ‘Birds fly’ and ‘Birds lay eggs’ are acceptable and true (because ‘flying’ and ‘laying eggs’ are among the most distinguishable features for birds). Our weak analysis of generics also explains examples paradoxical for many other theories: First, although only (adult) male lions have manes, ‘Lions have manes’ is an accepted generic, but ‘Lions are male’ is not.Footnote 14 Our analysis thus correctly predicts that ‘As are C’ can be true and ‘As are D’ false, although \(P(D|A) > P(C|A) < \frac{1}{2}\). Second, it explains why ‘Peacocks lay eggs’ and ‘Peacocks have beautiful feathers’ are both considered true, although no peacock lays eggs (female) and has beautiful feathers (male). Both generics are predicted to be true simply because relative to other animals (in general), many peacocks have the relevant features.

Thus, our proposal provides a uniform analysis of all types of examples discussed in this paper, including various types of indicative conditionals and generics. What this analysis of generics does not yet explain is why people typically interpret generics of the form ‘As are C’ as saying that (almost) all As are C. In van Rooij and Schulz (2019b) it was argued that this was due to the fact that people confuse representativeness for conditional probability, and accounted for this making use of Tversky and Kahneman’s (1974) ‘heuristics and biases’-program. At this point, however, we think that the strong interpretation of generics can better be explained in terms of how we learn generalizations.Footnote 15

Representativeness As Expectation

Even if hearers accept conditionals of the form ‘If A, then C’ due to our proposed weak acceptance rules, hearers still interpret conditionals typically in a much stronger way: the likelihood of C given A is high (Adams 1965). Why? We think it has something to do with how we learn generalizations.

In behavioral psychology, the learning of generalizations, or expectations, was studied in classical conditioning (or Pavlovian conditioning). What is the expectation that the \(n + 1\)th cue a will be accompanied with consequence c?Footnote 16 The perhaps most natural idea would be that it is just the times that cue a was accompanied with consequence c divided by the times that cue a was given at all. If we say that \(O_i(c|a) = 1\) if at the ith exposure cue a is accompanied with consequence c, and that \(O_i(c|a) = 0\) if at the ith exposure cue a is not accompanied with consequence c, the expectation that the \(n + 1\)th cue a will be accompanied with consequence o, i.e., \(P^*_{n + 1} (c|a)\), can be stated as follows:

$$\begin{aligned} ({\textit{RF}}) \quad P^*_{n + 1} (c|a)= & {} \frac{O_1(c|a) + \cdots + O_n(c|a)}{n} \\= & {} \frac{1}{n} \sum _{i =1}^{n} O_i(c|a) \end{aligned}$$

It can be shown, however, that for the calculation of \(P^*_{n + 1} (c|a)\) it is not needed to maintain a record of all cases where cue a was accompanied with consequence c. One can calculate \(P^*_{n + 1} (c|a)\) incrementally as well, by constantly changing the expectations. This can be shown as follows (adapted from a very similar proof by Sutton and Barto (2016)):

$$\begin{aligned} P^*_{n + 1} (c|a)&= \frac{1}{n} \sum _{i =1}^{n} O_i(c|a)\\ &= \frac{1}{n} \left( O_n(c|a) + \sum _{i =1}^{n - 1} O_i(c|a) \right) \\ &= \frac{1}{n} \left( O_n(c|a) + (n - 1) \frac{1}{n - 1} \sum _{i =1}^{n - 1} O_i(c|a) \right) \\ &= \frac{1}{n} \left( O_n(c|a) + (n - 1) P^*_n(c|a) \right) \\ &= \frac{1}{n} \left( O_n(c|a) + n P^*_n(c|a) - P^*_n (c|a) \right) \\ &= P^*_n(c|a) + \frac{1}{n} \left( O_n(c|a) - P^*_n(c|a) \right) \end{aligned}$$

Notice that the last incremental learning rule always gives rise to the relative frequency observed, with small demands on memory and computation power. It turns out that the form of this incremental learning rule is very common. It is known as learning by expected error minimization and is used in almost all modern methods of machine learning. The general form of such rules is as follows:

$$\begin{aligned} {\textit{NewExpectation}} \quad \longleftarrow \quad {\textit{OldExpectation}} + {\textit{Stepsize}} [{\textit{Target}} - {\textit{OldExpectation}}] \end{aligned}$$

The \({\textit{Stepsize}}\) is also know as the learning rate. In the case above this was \(\frac{1}{n}\), but many times this is taken to be a small constant. The \({\textit{Target}}\) is the value of the new observation, \(O_i(c|a)\). Above, the target was 1 or 0, but this could in general be anything you want. In particular, it could depend on the intensity of the consequence. Indeed, because \(P^*_{n + 1} (c|a) = \frac{1}{n} \sum _{i =1}^{n} O_i(c|a)\), if \(O_i(c|a)\) is high for each \(i \le n\) where a is accompanied with c, \(P^*_{n + 1} (c|a)\) will clearly be high as well, and much higher than the conditional frequency, in particular.

As we saw in Sect. 2, Rescorla (1968) observed that rats learn a tone (cue/cause)-shock (outcome/consequence) association if the frequency of shocks immediately after the tone is higher than the frequency of shocks undergone otherwise. This holds, even if in the minority of cases a shock actually follows the tone. Gluck and Bower (1988) and others show that humans learn associations between the representations of certain cues (properties or features) and consequence (typically another property or a category prediction) in a very similar way. Thus, we associate consequence c with cue a, not so much if P(c|a) is high, but rather if \(\Delta P^c_a = P(c|a) - P(c|\lnot a)\) is high.Footnote 17 How can this be explained? Rescorla and Wagner (1972) show that this can be explained by an error–based learning rule very similar to the one above. The only thing that really changes is that this time the learning rule is also competition-based. The idea is that a cue can also be taken as a combination of separate cues: if \(a_1\) and \(a_2\) are cues, \(a_1a_2\) is taken to be a cue as well, and they all could be accompanied with the same outcomes. According to Rescorla and Wagner (1972), we should keep track of expectations, or associations, for cue-action pairs for all primitive cues, i.e., \(a_1\) and \(a_2\). For the calculation of this expectation \(E^*_{n+1}(c|a_1)\) after the nth trial, however, we should also look at \(E^*_{n+1}(c|a_2)\) in case the actual cue at the nth trial is the combined cue \(a_1a_2\). The famous Rescorla–Wagner learning rule (RW) for each primitive cue \(a_i\) is stated as follows, if at the nth exposure (perhaps complex) cue \(a^*\) is given of which \(a_i\) is ‘part’ (where \(j \preceq a^*\) holds if \(a_j\) is part of the (perhaps) complex cue \(a^*\)):

$$\begin{aligned} ({\textit{RW}}) \quad E^*_{n+1}(c|a_i) = E^*_{n}(c|a_i) + \lambda \left( O_n(c|a^*) - \sum _{j \preceq a^*} E^*_n(c|a_j)\right) \end{aligned}$$

Here, \(E^*_{n+1}(c|a_i)\) is the agent’s expectation after n observations that the \(n+1\)th primitive cue \(a_i\) has outcome c, where \(\lambda\) is a learning rate (typically very small) and where \(O_n(c|a^*)\) measures the magnitude of the reinforcement at the nth trial where cue \(a_i\) was involved.Footnote 18 Notice that the cue at the nth trial could be just a primitive cue, but it could be a combined cue as well. If the nth cue is a combined cue like \(a_1a_2\), \(\sum _j E^*_n(c|a_j) = E^*_{n}(c|a_1) + E^*_{n}(c|a_2)\), will obviously be larger than \(E^*_{n}(c|a_i)\), and this has interesting consequences. For instance, if our learner is conditioned with the cue-outcome/consequence pairs \(a_1a_2 \rightarrow c\) and \(a_2 \rightarrow \lnot c\) that alternate each other, in the long run it will be that \(E^*(c|a_1) = 1\) and \(E^*(c|a_2) = 0\). Thus, \(a_1\) is associated with consequence c, and cue \(a_2\) is not associated with this consequence at all, although in half of the cases that cue \(a_2\) was involved, consequence c appeared. The opposite is predicted if the learner is conditioned with the cue-consequence pairs \(a_1a_2 \rightarrow c\) and \(a_2 \rightarrow c\) that alternate each other. In that case it will be that in the long run \(E^*(c|a_1) = 0\) and \(E^*(c|a_2) = 1\). Notice that these predictions are in accordance with what is predicted by the contingency rule, insofar as that in the first case \(\Delta P^c_{a_1} = 1\), while in the second case \(\Delta P^c_{a_1} = 0\).

More in general, Cheng (1997) shows that if the alternative cues for c are incompatible with a, \(E^*_{n+1}(c|a)\) converges to the actual conditional probability (or relative frequency). If alternative cues are compatible with a, however, \(E^*_{n+1}(c|a)\) yields, instead, \(\Delta P^c_a = P(c|a) - P(c|\lnot a)\) in the long run (see also Danks 2003). If the value O(c|a) is higher than 1 (in terms of the previous sections, this means that \(V(c|a) > 1\)), \(E^*_{n+1}(c|a)\) converges to the actual average conditional impact, \({\textit{EV}}(c|a) = P(c|a) \times O(c|a)\), if cues are mutually incompatible, and to something closer to \({\textit{EO}}(c|a) - {\textit{EO}}(c|\lnot a)\) otherwise. Thus, in many cases expectations, or associations, as generated by rule (RW) do not really measure probabilities; they measure something quite different. Still, it is only natural that people take this ‘something quite different’, i.e., the associations, to be the conditional likelihood. In fact, according to, e.g., Newel et al. (2007), we can explain many of the problematic probability judgements as found in, e.g., Tversky and Kahneman (1974) by the assumption that people confuse probabilities with associations as established via associative learning mechanisms like (RW).

Rule (RW) is only the simplest associative learning rule, and many variants have been proposed over the years (for instance with time- and cue-dependent learning rates, or where uncertainty of cues is taken into account), variants that give rise to (sometimes slightly) different convergence results. Yuille (2006), for instance, shows that there is a learning rule closely related to (RW) that converges to \(\frac{\Delta P^C_A}{1 - P(C|\lnot A)}\), i.e., the measure we used in acceptability rule (\({\textit{CON}}'\)). Most of these alternative learning rules have in common, however, that although they measure expectation, or association, in the long run they don’t end up with the relative frequency P(C|A) if there is competition between cues for the same outcome. In this way we explain why hearers accept conditionals of the form ‘If A, then C’ on relatively weak conditions, but that hearers still interpret conditionals typically in a much stronger way: the expectation of C given A is high.


In this paper we have proposed a uniform analysis of conditionals making use of a notion of ‘representativeness’. We have suggested that the proposed analysis can account for many examples standard analyses have problems with. The proposed analysis gives rise to rather weak acceptability conditions. We have suggested that the feeling that conditionals are typically interpreted in a stronger way is due to the way expectations are formed, via something like the Rescorla and Wagner (1972) competition-based learning rule. In this way, the intuition that the acceptance of the conditional ‘goes by’ conditional expectation can be explained as well.