1 Introduction

Conditional independence of sequential observations is a common assumption made to justify relatively simple estimators, such as simple maximum likelihood without any dynamic relationship between observations now and observations in the past. For naturally occurring data (such as most survey data), there is a well-founded concern that sequences of decisions made by the same observational unit are probably not conditionally independent of one another, and much econometric innovation addresses this problem for the kinds of econometric models commonly used for naturally occurring data.

Laboratory choice data might also display conditional dependence. Experimenters (both in economics and psychology) frequently estimate choice functions from such data and (to my knowledge and including myself) universally assume conditional independence of choices in choice sequences when they build likelihood functions to do this. Again, to my knowledge, there have been no direct tests for conditional dependence in any purpose-designed experiment. I do this here, testing conditional independence against a relatively simple restricted conditional dependence alternative hypothesis.

Conditional independence is a statistical and econometric assumption about conditional probabilities. Over the last four decades, innovation of laboratory methods, based closely on decision-theoretic “independence” axioms of various types took place. The decision-theoretic design of laboratory mechanisms, such as the random problem selection or RPS mechanism, can be justified by the “compound independence axiom” (Segal, 1990) or other decision-theoretic assumptions. When an experimenter employs such mechanisms, she means to make a choice now “independent” of decision problems her subject has already encountered in the laboratory session, but in a decision-theoretic sense of the word “independence” rather than any statistical or econometric meaning of the word “independence.”

A long history of experimental work (beginning perhaps with Starmer and Sugden (1991) and continuing through Brown and Healy (2018)) examines these decision-theoretic senses of independence (or as Brown and Healy wish to frame this, the statewise monotonicity axiom discussed by Azrieli et al. (2018)). In statistical terms, that long experimental literature focused on the behavior of marginal choice probabilities within a mechanism, asking whether the presence or absence of other decision problems (within the mechanism) affected observed choice proportions in a given decision problem.

The econometric and statistical sense of the term “conditional independence” concerns conditional, not marginal, choice probabilities. Yet decision-theoretic axioms such as the compound independence axiom or CIA of expected utility and other theories (or the statewise monotonicity axiom) may suggest that, in sequences of decision problems embedded within the RPS mechanism, a choice now should not only be independent of previous decision problems but also independent of previous choices. Therefore, I ask whether or not conditional independence (in its econometric and statistical sense) appears to be satisfied in a certain kind of decision-making experiment that is common in contemporary work on choice function estimation. Such experiments employ the RPS mechanism as well as other features that seem necessary to obtain an empirical version of decision theoretic independence (which I define shortly and call behavioral incentive compatibility or BIC).

Related work by Hey and Lee (2005a, b), and Hey and Zhou (2014), tests whether subjects appear to be optimizing one grand function of all decisions across all or some decisions in a sequence of decisions (not clearly a sufficient condition for conditional dependence) and those tests suggest that subjects do not do that. But conditional dependence could arise from other sources such as autocorrelated random preference parameter processes. The tests of Hey and Lee, and Hey and Zhou, depend on assumed structural models of risk preference. My test here depends on an identifying restriction but makes no assumptions concerning any specific underlying preference structure.

Within the limits of this experiment’s identifying restriction and designed power to detect deviations from conditional independence, conditional independence is not rejected. A substantial number of scholars may see this as good news since it has been very common practice to assume conditional independence when constructing likelihood functions for the estimation and analysis of choice functions from laboratory choice data (e.g., Andersen et al., 2008; Hey & Orme, 1994; Loomes et al., 2002; Rieskamp, 2008; Wilcox, 2008, 2011). My experimental results here suggest this has not been mistaken practice.

2 Definition of an experiment and its common contemporary features

In this article the term experiment refers to a specific class of choice experiments pioneered by Hey and Orme (1994) and common in experimental decision research. Here an experiment \(\mathcal{E}=\langle\Omega_1^i,\Omega_2^i,...,\Omega_j^i\rangle\) means a sequence of trials \(j =\{\text{1,2},\dots ,J\}\) where each subject \(i\in \{\text{1,2},\dots ,I\}\) chooses from a pair \({{\Omega }}_{j}^{i}=\bigl\{{R}_{j}^{i},{S}_{j}^{i}\bigr\}\) of lotteries. A lottery \({R}_{j}\) means a one-stage tabled probability distribution \(({r}_{lj},{r}_{mj},{r}_{hj})\) over a vector \(({l}_{j},{m}_{j},{h}_{j})\) of three possible money outcomes \(z\in {\mathbb{R}}^{+}\) where \({l}_{j}<{m}_{j}<{h}_{j}\). A one-stage tabled probability distribution is a probability measure of three mutually exclusive and exhaustive events, determined by one (and only one) simple random device such as a single throw of a six-sided die (as employed in my experiment). This rules out resolution of uncertainty by means of a sequence of two or more simple random devices (it rules out multi-stage probability distributions) because this seems empirically relevant (more on this shortly). Finally, neither \({R}_{j}^{i}\) nor \({S}_{j}^{i}\) first order stochastically dominates the other in any pair \({{\Omega }}_{j}^{i}\). The choice situation is known as “forced choice” in which no indifference response is permitted, so let \({c}_{j}^{i}=1\) if subject \(i\) chooses \({R}_{j}^{i}\) from \({{\Omega }}_{j}^{i}\) and \({c}_{j}^{i}=0\) if she chooses \({S}_{j}^{i}\) from \({{\Omega }}_{j}^{i}\).

Within each pair \({{\Omega }}_{j}=\bigl\{{R}_{j},{S}_{j}\bigr\}\), \({R}_{j}\) is relatively risky compared to the relatively safe \({S}_{j}\), meaning \({s}_{mj}>{r}_{mj}\), \({r}_{lj}>{s}_{lj}\), and \({r}_{hj}>{s}_{hj}\): \({R}_{j}\) has higher probabilities of the low and high outcomes \({l}_{j}\) and \({h}_{j}\), while \({S}_{j}\) has a higher probability of the middle outcome \({m}_{j}\). This conventional terminology is only descriptive (“safe” does not mean better than “risky”) and should not be confused with mean-preserving spreads (“risky” may or may not be a mean-preserving spread of “safe”).

In an experiment, each page (in the case of a physical booklet presentation) or each screen (in the case of a computer presentation) presents exactly one pair: Call this feature separated decisions or SED. An experiment also features the RPS mechanism meant to motivate subjects without creating unwanted portfolio or wealth effects across the trial sequence (Grether & Plott, 1979). After all \(J\) choices have been made by subject \(i\), a random device selects just one trial \({j}^{*}\) (every trial has an equal \({J}^{-1}\) chance of selection). Then subject \(i\) plays out only her chosen lottery in trial \({j}^{*}\) using a second random device, and this is her sole payment from her choices. Subjects may also receive a fixed payment simply for showing up on time for an experiment but this is not connected to the choices they make.

Under either the CIA of expected utility and other theories (Segal, 1990), the isolation effect of prospect theory (Kahneman & Tversky, 1979), or the statewise monotonicity axiom defined by Azrieli et al. (2018), experiments featuring RPS should achieve what I call BIC. Consider a \(J\) pair experiment \(\mathcal{E}=\langle\Omega_1^i,\Omega_2^i,...,\Omega_j^i,...\Omega_J^i\rangle\) and a one pair experiment \(\mathcal{E}^{\circ}=\langle\Omega_{1^\circ}^{i}\rangle\) where \({{\Omega }}_{{1}^{^\circ }}^{i}\equiv {{\Omega }}_{j}^{i}\) are the same pair: BIC holds if \(P\bigl({c}_{j}^{i}=1\bigr)=P\bigl({c}_{{1}^{^\circ }}^{i}=1\bigr)\). Put differently, BIC holds when the probability of choosing \({R}_{j}^{i}\) in a \(J\) pair experiment equals the choice probability of choosing \({R}_{{1}^{^\circ }}^{i}\) in an experiment presenting only that pair. Notice that BIC concerns marginal probabilities rather than conditional probabilities. I doubt whether a one pair experiment would ever elicit true preference from a completely novice subject (Wilcox, 2021), but this belief is widespread and shapes many experimental studies of incentive compatibility.

Brown and Healy (2018) fail to reject BIC in experiments as defined here when the experiment features both SED and RPS. Baltussen et al. (2012) show that BIC can fail when trials are not choices from lottery pairs (in particular, where each trial is a sequence of decisions in a multi-stage risky choice game). Harrison and Swarthout (2014) and Cox et al. (2015) find evidence against BIC (with both SED and RPS) using some structural tests. Brown and Healy contend that most such evidence is best interpreted as framing effects rather than a failure of BIC. With some caution since scholars disagree, I follow Brown and Healy’s view here: My new experiment here uses both SED and RPS and no multi-stage lotteries (in deference to Baltussen et al.’s findings). More generally, this experiment is not about achieving BIC, which is defined in terms of marginal choice probabilities and concerns measurement of “true” risk preferences, but rather about conditional independence which concerns conditional choice probabilities.

3 Purpose of the new experiment

The simplest model of choice probabilities \(P\bigl({c}_{j}^{i}=1\bigr)={h}^{i}\bigl({{\Omega }}_{j}^{i}\bigr)\) does condition on the offered pair \({{\Omega }}_{j}^{i}\) and subject \(i\)’s choice function \({h}^{i}\left( \right)\) (and sometimes a population choice function \(h\left( \right)\)). A choice function maps the offered pair and subject \(i\)’s preference order into a probability of observing \({c}_{j}^{i}=1\). The absence of previous choices in \({h}^{i}\left( \right)\) is the relevant conditional independence assumption that greatly simplifies construction of the likelihood of a choice sequence \({c}^{i}=({c}_{1}^{i},{c}_{2}^{i},\dots ,{c}_{J}^{i})\) and minimizes the number of parameters to be estimated. Behavioral economists (and psychologists) widely make this assumption for likelihood-based analysis of choice sequences (e.g., Andersen et al., 2008; Hey & Orme, 1994; Loomes et al., 2002; Rieskamp, 2008; Wilcox, 2008, 2011). In general choices may be conditionally dependent: True choice probabilities would then be \(P\bigl({c}_{j}^{i}=1\bigr)={g}^{i}\bigl({{\Omega }}_{j}^{i}|{c}_{j-1}^{i},{c}_{j-2}^{i}{,\dots ,c}_{1}^{i}\bigr)\not\equiv {h}^{i}\bigl({{\Omega }}_{j}^{i}\bigr)\).

Here, I test the null hypothesis of conditional independence against a restricted conditional dependence given by \(P\bigl({c}_{j}^{i}=1\bigr)={f}^{i}\bigl({{\Omega }}_{j}^{i}|{c}_{j-1}^{i}\bigr)\). This restricts conditional dependence to the immediately preceding choice \({c}_{j-1}^{i}\). I view this as the most likely place to find conditional dependence if it is present, and so a good place to start with experimental tests. This form of conditional dependence informs my experimental design, data analysis, and power planning detailed in the Appendix.

Henceforth I suppress explicit conditioning on \({{\Omega }}_{j}^{i}\), taking it as implicit that all choice probabilities are conditioned on the offered pair and subject \(i\)’s specific choice function \({f}^{i}\left( \right)\). Subsequently, then, \(P\bigl({R}_{j}^{i}|{R}_{j-1}^{i}\bigr)\) will mean \({f}^{i}\bigl({{\Omega }}_{j}^{i}|1\bigr)\) and \(P\bigl({R}_{j}^{i}|{S}_{j-1}^{i}\bigr)\) will mean \({f}^{i}\bigl({{\Omega }}_{j}^{i}|0\bigr)\), while \(P\bigl({R}_{j}^{i}\bigr)\) written without any condition will mean the marginal probability that \({c}_{j}^{i}=1\) given that subject \(i\) chooses from pair \({{\Omega }}_{j}^{i}\).

4 Experimental design

Let \(t\) and \(\tau \in \{\text{1,2},\dots ,50\}\) index two distinct sequences of 50 choice pairs, the \(t\) sequence and the \(\tau\) sequence. The design presents each subject with these two sequences, for \(J=100\) total choice pairs. The order of presentation of the \(t\) and \(\tau\) sequences is varied across subjects: Let \({\mathcal{O}}_{1}\) and \({\mathcal{O}}_{2}\) denote sets of subjects \(i\) who receive the \(t\) sequence or \(\tau\) sequence first, respectively. The two sequences are separated by a short unpaid survey (as described below, the survey just gives subjects a short break between the sequences that may make the experiment’s identifying restriction more plausible; responses to survey questions are of no interest here).

The two sequences hold the same 12 target pairs within them in exactly the same order. The target pairs have indices \(t\) and \(\tau\) in \(\mathcal{T}=\left\{\text{10,13,16,19,22,25,28,31,34,37,40,43}\right\}\). This yields a test and (fifty pairs later) a subsequent retest of choice from each target pair. Target pairs are identical across the two sequences. For example, target pairs \(t=10\) and \(\tau =10\) are exactly the same choice pair.

Conditioning pairs with indices \(t\) and \(\tau\) in \(\mathcal{C}=\left\{\text{9,12,15,18,21,24,27,30,33,36,39,42}\right\}\) immediately precede each target pair. These pairs differ across the \(t\) and \(\tau\) sequences. For example, conditioning pairs \(t=9\) and \(\tau =9\) (presented just before their common target pair \(t=\tau =10\)) are different choice pairs: In pair \(t=9\), \({R}_{t}\) is meant to be more attractive than \({S}_{t}\) (call this a high conditioning pair) for most subjects, while in pair \(\tau =9,\)\({S}_{\tau }\) is meant to be more attractive than \({R}_{\tau }\) (call this a low conditioning pair) for most subjects. Judgments concerning this were based on a previous experiment sampling the same subject population. This manipulation makes it likely that any subject comes to the two presentations of identical target pair \(t=\tau =10\) with two different choice histories (different choices at \(t=\tau =9\)). Similarly, for each \(t=\tau \in \mathcal{T}\) a high conditioning pair immediately precedes \(t\) or \(\tau\) while a low conditioning pair immediately precedes the other matched target pair. Table 1 shows that this manipulation was largely successful.

Table 1 Choice percentages (of 204 subjects) in conditioning pairs

Figure 1 illustrates the overall trial sequence in the experiment. Notice the additional presence of buffer pairs which serve several design purposes. First, a buffer pair separates each pair of a conditioning and target pair from the next such pair of pairs (see panel B of Fig. 1). Second, both the \(t\) and \(\tau\) sequences begin and end with seven buffer pairs. This gives subjects a (short) warm-up prior to presentation of pairs of conditioning and target pairs and additionally keeps these away from the ends of sequences (when subjects might begin relaxing their concentration). Appendix Table A1 lists all of the choice pairs.

Fig. 1
figure 1

The experiment sequence for the subjects \(\varvec{i}\in {\mathcal{O}}_{1}\) (receiving the t sequence first)

5 Hypotheses and data analysis

In Eqs. 1, 2, 3 and 4 below I index pairs by locations \(k=l=m\in \mathcal{T}\), with exactly one of \(k\) or \(l\) in the \(t\) sequence and the other in the \(\tau\) sequence. That is, both \(k\) and \(l\) are the same target pair location \(m\), one in the \(t\) sequence and the other in the \(\tau\) sequence and, for the time being, which is which remains unspecified. The experimental design implies that one of \(k\) or \(l\) follows a high conditioning pair while the other follows a low conditioning pair. With all this in mind, conditionally independent and identically distributed trials imply that

$$\begin{array}{c}P\left({R}_{k}^{i}\cap {R}_{k-1}^{i}\right)/P\left({R}_{k-1}^{i}\right)\equiv P\left({R}_{k}^{i}|{R}_{k-1}^{i}\right)=P\left({R}_{l}^{i}|{S}_{l-1}^{i}\right)\equiv P\left({R}_{l}^{i}\cap {S}_{l-1}^{i}\right)/P\left({S}_{l-1}^{i}\right)\end{array}.$$
(1)

Rearrange the left-most and right-most terms of Eq. 1 to get the null hypothesis

$$\begin{array}{c}{H}_{0}\mathrm{:}\; P\left({R}_{k}^{i}\cap {R}_{k-1}^{i}\right)P\left({S}_{l-1}^{i}\right)-P\left({R}_{l}^{i}\cap {S}_{l-1}^{i}\right)P\left({R}_{k-1}^{i}\right)=0.\end{array}$$
(2)

To test this null, define these twelve data-derived within-subject differences for each subject \(i\):

$$\begin{array}{c}{y}_{m}^{i}=1\left({c}_{k}^{i}=1\cap {c}_{k-1}^{i}=1\right)\cdot 1\left({c}_{l-1}^{i}=0\right)-1\left({c}_{l}^{i}=1\cap {c}_{l-1}^{i}=0\right)\cdot 1\left({c}_{k-1}^{i}=1\right).\end{array}$$
(3)

Adopt the indexing convention that, when it is possible to do so, the target pair indices \(k\) and \(l\) are assigned to the \(t\) and \(\tau\) sequences so that \({c}_{k-1}^{i}=1\) and \({c}_{l-1}^{i}=0\). (Notice that whenever this is not possible, \({y}_{m}^{i}=0\) regardless of the assignment of those indices). The design’s conditioning pair features are meant to make \(\left({R}_{k-1}^{i}\cap {S}_{l-1}^{i}\right)\) a likely event in the data for \(k=l=m\in \mathcal{T}\). Table 2 shows the experiment’s joint distributions of safe and risky choices in pairs of high and low conditioning pairs: The sum of the off-diagonal cells in these tables give the percent of subjects for whom \(k\) and \(l\) can be assigned such that events \(\left({R}_{k-1}^{i}\cap {S}_{l-1}^{i}\right)\) occur and shows that these are common in the data, as intended.

Table 2 Empirical joint distribution of choices in high and low conditioning pairs (percentages of 204 subjects)

To know the expected value of each \({y}_{m}^{i}\), I need an identifying restriction:

Identifying Restriction: \({R}_{k}^{i}\cap {R}_{k-1}^{i}\) and \({S}_{l-1}^{i}\) are conditionally independent, and \({R}_{l}^{i}\cap {S}_{l-1}^{i}\) and \({R}_{k-1}^{i}\) are conditionally independent.

This identifying restriction is implied by both the null and alternative hypotheses. Beyond the specifics of the null and alternative hypotheses, the restriction requires that at a remove of fifty trials there is no dependence between the test and the retest of the same target pair and the conditioning pairs preceding them. The design’s survey break between the \(t\) and \(\tau\) sequences is meant to enhance the plausibility of this “no memory” assumption between the two sequences. Under this assumed restriction,

$$\begin{array}{c}E\left[{y}_{m}^{i}\right]=P\left({R}_{k}^{i}\cap {R}_{k-1}^{i}\right)P\left({S}_{l-1}^{i}\right)-P\left({R}_{l}^{i}\cap {S}_{l-1}^{i}\right)P\left({R}_{k-1}^{i}\right).\end{array}$$
(4)

Therefore, defining the observation from each subject \(i\) as \({y}^{i}=\frac{1}{12}{\sum }_{m\in \mathcal{T}}\space {y}_{m}^{i}\), a one-sample test against a zero location of the \({y}^{i}\) tests the null of Eq. 2 against the alternative of conditional dependence.

Given the construction of \({y}^{i}\) detailed above (especially the indexing convention), nonzero values of \({y}^{i}\) are evidence favoring one of two alternatives. When \({y}^{i}>0\), relatively risky choices are more common (for subject \(i\)) when preceded by a relatively risky choice than when preceded by a relatively safe choice: On average we observe persistence of the choices of subject \(i\). When \({y}^{i}<0\), relatively risky choices are less common when preceded by a relatively risky choice than when preceded by a relatively safe choice: On average we observe alternation of the choices by subject \(i\).

A simple one-parameter odds ratio model of conditional dependence (e.g., Carey et al., 1993; Lipsitz et al., 1991) captures both possibilities (persistence or alternation) and this model motivated the experimental design and informed my power analysis of the design. The Appendix contains that power analysis, which is for a two-tailed t-test against the null hypothesis of Eq. 2, at a size of 5%, given effect sizes described in the Appendix. To obtain power of 90%, the analysis recommends a sample size of \(200\) subjects. The actual sample size is 204 subjects \(i\), with half in the \({\mathcal{O}}_{1}\) pair ordering and the other half in the \({\mathcal{O}}_{2}\) ordering.

The above construction of the null hypothesis and the observation \({y}^{i}\) for testing it assumes not only conditional independence but identically distributed trials of target pair choices across the \(t\) and \(\tau\) sequences. There have been several studies showing a simple drift across trials toward more risk aversion (e.g., Ballinger & Wilcox, 1997; Hey & Orme, 1994; Loomes & Sugden, 1998) so to check for it define the observation

$$x^i=\frac1{12}\biggl[\mathbf{1}\left(i\epsilon {\mathcal O}_1\right)\sum\limits_{t=\tau\in\mathcal T}\left(c_t^i-c_\tau^i\right)+\mathbf{1}\left(i\epsilon {\mathcal O}_2\right)\sum\limits_{t=\tau\in\mathcal T}\left(c_\tau^i-c_t^i\right)\biggr],$$
(5)

which is just the difference between observed risky choices of subject \(i\) in her first and second trials of target pairs. Figure 2 displays the empirical cumulative distribution function of \({x}^{i}\) across the experiment’s 204 subjects. The sample mean of \({x}^{i}\) and that mean’s standard error are −0.0041 and 0.0093, respectively, suggesting an absence of significant simple drift in the new experiment.

Fig. 2
figure 2

Empirical CDF of differences between risky choice proportions across the first and second sequences presented (204 subjects)

Figure 3 displays the empirical cumulative distribution function of \({y}^{i}\) across the experiment’s 204 subjects. The sample mean of \({y}^{i}\) and that mean’s standard error are 0.0069 and 0.0084, respectively, yielding a \(t\)-statistic with absolute value less than one, so there is no significant violation of conditional independence in the new experiment. The statistic is positive, suggesting that if there is any conditional dependence here, it is perhaps a bit of persistence of choice.

Fig. 3
figure 3

Empirical cumulative distribution function of yi across 204 subjects, the critical test observation against the null of conditional independence

6 Conclusions

It appears that when an experimenter uses certain contemporary experimental mechanisms and features in an individual choice experiment, conditional independence of observed choices is an acceptable assumption. To my knowledge, the new experiment reported here is the first direct test of conditional independence, though the tests reported by Hey and Lee (2005a, b), and Hey and Zhou (2014), may weigh in favor of conditional independence as well. And perhaps this does not need emphasis, but neither my evidence nor that of Hey and his co-authors says anything at all about other decision experiments where choice problems are not pairs of one-stage lotteries, or RPS and SED are not features of the experiment. Nor does this evidence say anything about other sorts of experiments such as repeated games or repeated markets. Other scholars could investigate the status of conditional independence in these other kinds of experiments.

My data analysis and experimental design depended on two things: First, no drift in choice probabilities across the two trials of my target choice pairs (which appears to be empirically acceptable); and second, an identifying restriction—in essence that at a remove of about fifty intervening trials there is no conditional dependence. I have no test of that assumption, but believe it is defensible. The two oldest facts from human memory research are the primacy and recency effects. The recency effect suggests that if there is any conditional dependence, we should probably expect to detect it in recently past choices (say one or two trials ago) rather than at a remove of fifty trials past. The primacy effect is that the earliest events or stimuli in a sequence are more likely to be remembered. My experimental design pads the front end of each choice sequence (the earliest trials, most exposed to any primacy effect) with seven buffer pairs not used in my test. However, I accept that there is room for doubt about my identifying restriction.

A helpful referee pointed out that my test is a population-level test. It’s certainly possible that there are some members of the population that exhibit persistence, and others that exhibit alternation, and that in the aggregate of a population test they cancel each other out. This could be examined by a structural estimation looking for significant variance of the odds ratio parameter \(\gamma\) (see the Appendix) across subjects. The experimental design here was not chosen by me to yield a high-power test against zero variance of \(\gamma\) across subjects, and my goal in this work has been to carry off a test without any structural estimation. But obviously this is an interesting question since many scholars estimate models one subject at a time (e.g., Hey & Orme, 1994). Accordingly it is certainly worth exploring whether some subjects do display either persistence or alternation in our subject populations.

Behavioral econometricians and psychometricians frequently assume conditional independence when they construct their likelihood functions for structural estimation of choice functions from discrete choice sequences observed in laboratory experiments. They may take some comfort from my results—assuming, of course, that their experiment employs RPS, and SED, and that their subjects’ choices are from pairs of one-stage lotteries. For the rest, we await new experiments testing conditional independence in other experimental situations.

7 Appendix

An odds ratio model (Carey et al., 1993; Lipsitz et al., 1991) of restricted conditional dependence guided my power analysis for designing the experiment:

Constant odds ratio of four joint probabilities parameterized by the constant \(\gamma >0\):

$$\begin{array}{c}\gamma =P\bigl({R}_{j}^{i}\cap {R}_{j-1}^{i}\bigr)\cdot P\bigl({S}_{j}^{i}\cap {S}_{j-1}^{i}\bigr)/\bigl[P\bigl({R}_{j}^{i}\cap {S}_{j-1}^{i}\bigr)\cdot P\bigl({S}_{j}^{i}\cap {R}_{j-1}^{i}\bigr)\bigr]>0.\end{array}$$
(A.1)

The four joint probabilities add up to unity (probability theory identity):

$$\begin{array}{c}1=P\bigl({R}_{j}^{i}\cap {R}_{j-1}^{i}\bigr)+P\bigl({S}_{j}^{i}\cap {S}_{j-1}^{i}\bigr)+ P\bigl({R}_{j}^{i}\cap {S}_{j-1}^{i}\bigr)+P\bigl({S}_{j}^{i}\cap {R}_{j-1}^{i}\bigr).\end{array}$$
(A.2)

Pairs of joint probabilities add up to marginal probabilities (probability theory identities):

$$\begin{array}{c}P\bigl({R}_{j}^{i}\bigr)=P\bigl({R}_{j}^{i}\cap {R}_{j-1}^{i}\bigr)+P\bigl({R}_{j}^{i}\cap {S}_{j-1}^{i}\bigr),\end{array}$$
(A.3)
$$\begin{array}{c}P\bigl({R}_{j-1}^{i}\bigr)=P\bigl({R}_{j}^{i}\cap {R}_{j-1}^{i}\bigr)+P\bigl({S}_{j}^{i}\cap {R}_{j-1}^{i}\bigr).\end{array}$$
(A.4)

With given values of \(\gamma\), \(P\bigl({R}_{j}^{i}\bigr)\), and \(P\bigl({R}_{j-1}^{i}\bigr)\) in hand, Eqs. A.1, A.2, A.3 and A.4 imply the following quadratic equation in \(P\bigl({R}_{j}^{i}\cap {R}_{j-1}^{i}\bigr)\):

$$\begin{array}{c}\left(\gamma -1\right){\bigl[P\bigl({R}_{j}^{i}\cap {R}_{j-1}^{i}\bigr)\bigr]}^{2}+{\alpha }_{j}^{i}P\bigl({R}_{j}^{i}\cap {R}_{j-1}^{i}\bigr)+{\beta }_{j}^{i}=0,\end{array}$$
(A.5)

where \({\alpha }_{j}^{i}=\left(1-\gamma \right)\bigl[P\bigl({R}_{j}^{i}\bigr)+P\bigl({R}_{j-1}^{i}\bigr)\bigr]-1\) and \({\beta }_{j}^{i}=\gamma P\bigl({R}_{j}^{i}\bigr)\cdot P\bigl({R}_{j-1}^{i}\bigr)\)

When \(\gamma \ne 1\), the quadratic formula gives roots of this equation. Only one root is well-behaved in the sense that the solution \(P\bigl({R}_{j}^{i}\cap {R}_{j-1}^{i}\bigr)\) is always in [0,1] \(\forall\; \gamma \ne 1\)): It is

$$P\bigl({R}_{j}^{i}\cap {R}_{j-1}^{i}\bigr)=- 0.5\cdot \left({\alpha }_{j}^{i}+{\bigl[{\bigl({\alpha }_{j}^{i}\bigr)}^{2}-4\bigl(\gamma -1\bigr){\beta }_{j}^{i}\bigr]}^{0.5}\right){\bigl(\gamma -1\bigr)}^{-1}\; \forall\; \gamma \ne 1,$$
(A.6)

(and \(P\bigl({R}_{j}^{i}\cap {R}_{j-1}^{i}\bigr)= P\bigl({R}_{j}^{i}\bigr)\cdot P\bigl({R}_{j-1}^{i}\bigr)\) for \(\gamma =1\)).

The solution from Eq. A.6 allows a sequential solution for the other three joint probabilities using Eqs. A.2, A.3 and A.4:

$$\begin{array}{c}P\bigl({R}_{j}^{i}\cap {S}_{j-1}^{i}\bigr)=P\bigl({R}_{j}^{i}\bigr)-P\bigl({R}_{j}^{i}\cap {R}_{j-1}^{i}\bigr),\end{array}$$
(A.7)
$$\begin{array}{c}P\bigl({S}_{j}^{i}\cap {R}_{j-1}^{i}\bigr)=P\bigl({R}_{j-1}^{i}\bigr)-P\bigl({R}_{j}^{i}\cap {R}_{j-1}^{i}\bigr),\; \mathrm{and}\end{array}$$
(A.8)
$$\begin{array}{c}P\bigl({S}_{j}^{i}\cap {S}_{j-1}^{i}\bigr)=1-P\bigl({R}_{j}^{i}\cap {R}_{j-1}^{i}\bigr)-P\bigl({R}_{j}^{i}\cap {S}_{j-1}^{i}\bigr)-P\bigl({S}_{j}^{i}\cap {R}_{j-1}^{i}\bigr).\end{array}$$
(A.9)

In turn, with given values of \(P\bigl({R}_{j}^{i}\bigr)\) and \(P\bigl({R}_{j-1}^{i}\bigr)\) in hand, Eqs. A.6 and A.7 then give solutions for the key conditional probabilities \(P\bigl({R}_{j}^{i}|{R}_{j-1}^{i}\bigr)\) and \(P\bigl({R}_{j}^{i}|{S}_{j-1}^{i}\bigr)\) given any value of \(\gamma\) one wishes to specify as an interesting alternative hypothesis. The upper panels of Figs. A1 and A2 graph these conditional probabilities for a \(t=\tau \in \mathcal{T}\) target pair where \(t-1\) is a high conditioning pair with \(P\left({R}_{t-1}^{i}\right)=0.85\) and \(\tau -1\) is a low conditioning pair with \(P\left({S}_{\tau -1}^{i}\right)=0.85\) (approximately reflecting the average results for conditioning pairs shown in Table 1). Figure A1 assumes that \(\gamma =2\) yielding persistence so that \(P\left({R}_{t}^{i}|{R}_{t-1}^{i}\right)-P\left({R}_{\tau }^{i}|{S}_{\tau -1}^{i}\right)>0\) at any common marginal probability \(P\left({R}_{t}^{i}\right)=P\left({R}_{\tau }^{i}\right)\) (shown on the horizontal axis). Figure A2 instead assumes that \(\gamma =0.5\) yielding alternation so that \(P\left({R}_{\tau }^{i}|{S}_{\tau -1}^{i}\right)-P\left({R}_{t}^{i}|{R}_{t-1}^{i}\right)>0\) at any common marginal probability \(P\left({R}_{t}^{i}\right)=P\left({R}_{\tau }^{i}\right)\).

The lower panels of Figs. A1 and A2 graph corresponding effect sizes. For example, to draw the lower panel of Fig. A1, one divides the difference \(P\left({R}_{t}^{i}|{R}_{t-1}^{i}\right)-P\left({R}_{\tau }^{i}|{S}_{\tau -1}^{i}\right)\) under the alternative hypothesis \(\gamma =2\) by the standard deviation \({\left(2P\left({R}_{t}^{i}\right)\left[1-P\left({R}_{t}^{i}\right)\right]\right)}^{0.5}\) of that difference under the null hypothesis that \(P\left({R}_{t}^{i}|{R}_{t-1}^{i}\right)=P\left({R}_{\tau }^{i}|{S}_{\tau -1}^{i}\right)= P\left({R}_{t}^{i}\right)\) (which is \(\gamma =1\)). The figures reveal that these effect sizes are on the small side. Cohen’s (1988) convention for these kinds of effect sizes calls 0.2 and 0.5 small and medium effect sizes, and those in the figures never quite reach 0.25 regardless of the common marginal probability \(P\left({R}_{t}^{i}\right)=P\left({R}_{\tau }^{i}\right)\). This is one reason for the repeated measurement of the design (that is, why there are twelve pairs of target and conditioning pairs in each sequence, providing twelve values \({y}_{m}^{i}\) which are then averaged within each subject to yield the overall within-subject measure \({y}^{i}\) of deviations from conditional independence for each subject \(i\)).

Fig. A1
figure 4

Conditional probabilities and effect size implied by \(P\left({R}_{t}^{i}\right)=P\left({S}_{\tau-1 }^{i}\right)\) = 0.85 and persistence (\(\gamma =2\)

The figures also reveal an asymmetry relevant to the experimental design. Under the alternative hypothesis of persistence (\(\gamma =2\)) the range of marginal probabilities achieving effect sizes of at least 0.2 is about 0.30 to 0.85. But under the alternative hypothesis of alternation (\(\gamma =0.5\)), the range of marginal probabilities achieving effect sizes of at least 0.2 is about 0.15 to 0.70. The compromise range most useful for both alternative hypotheses is to (try to) choose target pairs with marginal probabilities in a range from about 0.30 to 0.70. On the other hand, some marginal probabilities outside this range are among those most useful for estimation of preferences (Manski & McFadden, 1981; Kanninen, 2002). In this design, I attempted to choose target pairs which, on the basis of past experimental results with the population I sample from in this experiment, would have population mean probabilities falling across most of the unit interval. Half of the twelve target pair tests and retests fall within the range from 0.30 to 0.70 mentioned above, with the other half more extreme.

Fig. A2
figure 5

Conditional probabilities and effect size implied by \(P\left({R}_{t}^{i}\right)=P\left({S}_{\tau-1 }^{i}\right)\) = 0.85 and alternation (\(\gamma =0.5\))

I want to choose a sample size for this experiment based on a previous experiment with the same subject population. There were 501 undergraduate subjects in this previous experiment choosing from 72 lottery pairs on the outcome range $8 to $48, using a four-sided die as the chance device. This unpublished experiment was completed in January 2010 in collaboration with the late John Dickhaut. Using this data and assuming conditional independence in constructing the likelihood, I estimated a random parameters Rank Dependent Utility or RDU model (Quiggin, 1982, 1993). RDU is essentially the same as Tversky and Kahneman’s (1992) Cumulative Prospect Theory limited to lotteries over gains (positive outcomes).

This yields an estimated distribution of preference parameter vectors \(\theta\) in the population of likely subjects at the university where the current experiment was done. I sample from this estimated distribution to choose a sample size with desired power 0.9 to reject the null of conditional independence when there is in fact conditional dependence of either strength \(\gamma \le 0.5\) or \(\gamma \ge 2\). The estimated distribution gives me the needed marginal choice probabilities while the Appendix Eqs. A.6, A.7, A.8 and A.9 give me the necessary joint choice probabilities under the alternative of conditional dependence of this strength.

With this estimation completed, I draw 1000 simulated subjects indexed by \(n\in \{\text{1,2},\dots ,1000\}\) from my estimated distribution of RDU parameters, and for design planning purposes I regard these 1000 simulated subjects as “the population” I sample from when I run an experiment. Each simulated subject is a vector \({\theta }^{n}=({\kappa }^{n},{\mu }^{n},{\omega }^{n},{\lambda }^{n})\) of four probabilistic RDU model parameters described below.

The parameter \({\kappa }^{n}\in \mathbb{R}\) is utility curvature in this HARA utility function:

$$\begin{array}{c}{u\left(z|{\kappa }^{n}\right)=\left(1-{\kappa }^{n}\right)}^{-1}\left[-1+{\left(1+z\right)}^{\left(1-{\kappa }^{n}\right)}\right] \mathrm{for}\; {\kappa }^{n}\ne 1\; \mathrm{and}\; \mathrm{ln}\left(1+z\right) \mathrm{for}\; {\kappa }^{n}=1.\end{array}$$
(A.10)

The parameters \({\mu }^{n}\in \left(\text{0,1}\right)\) and \({\omega }^{n}\in (0,\infty )\) are elevation and curvature parameters of this Beta weighting function:

$$\begin{array}{c}w\left(G|{\mu }^{n},{\omega }^{n}\right)=B\left(G|{a}^{n},{b}^{n}\right)\; \mathrm{where}\; {a}^{n}={\mu }^{n}{\omega }^{n}\; \mathrm{and}\; {b}^{n}=\left(1-{\mu }^{n}\right){\omega }^{n},\end{array} \mathrm{where}\;$$
(A.11)

\(G\) is the decumulative probability distribution function of a lottery, and

\(B\left(x\right|{a}^{n},{b}^{n})\) is the cumulative distribution function of the Beta distribution.

The parameter \({\lambda }^{n}\in (0,\infty )\) is a precision or sensitivity parameter of the probabilistic RDU model of choice I use in the random parameters estimation. The RDU model of marginal probabilities is then

$$\begin{array}{c}P\bigl({R}_{j}^{n}|{\theta }^{n}\bigr)={\Lambda } \bigl({\lambda }^{n}{\Delta }{RDU}_{j}^{n}\bigr), \mathrm{where}\;{\Delta }{RDU}_{j}^{n}=RDU\bigl({R}_{j}^{n}\bigr)-RDU\bigl({S}_{j}^{n}\bigr)\end{array}.$$
(A.12)

\({\Lambda }\left(x\right)={\left[1+\text{e}\text{x}\text{p}\left(x\right)\right]}^{-1}\) is the logistic cumulative distribution function, where:

$$RDU\bigl({R}_{j}^{n}\bigr)= {\pi }_{hj}({\mu }^{n},{\omega }^{n})+{\pi }_{mj}({\mu }^{n},{\omega }^{n}){v}_{j}\left({m}_{j}|{\kappa }^{n}\right),$$
$${\pi }_{hj}\left({\mu }^{s},{\omega }^{s}\right)=w\left({r}_{hj}|{\mu }^{n},{\omega }^{n}\right), {\pi }_{mj}\left({\mu }^{n},{\omega }^{n}\right)=w\left({r}_{hj}+{r}_{mj}|{\mu }^{n},{\omega }^{n}\right)-w\left({r}_{hj}|{\mu }^{n},{\omega }^{n}\right),$$

and

$${v}_{j}\left({m}_{j}|{\kappa }^{n}\right)=\left[u\left({m}_{j}|{\kappa }^{n}\right)-u\left({l}_{j}|{\kappa }^{n}\right)\right]/\left[u\left({h}_{j}|{\kappa }^{n}\right)-u\left({l}_{j}|{\kappa }^{n}\right)\right].$$

This specification of marginal RDU choice probabilities employs the contextual utility probabilistic choice model of Wilcox (2008, 2011) which is appropriate for three-outcome pairs of lotteries. These marginal probabilities, calculated for all \(J=100\) pairs in the design for each of the 1000 simulated subjects \(n\), are the choice probabilities under the null hypothesis \(\gamma =1\). Conditional choice probabilities may be calculated from them by way of Eqs. A.6 and A.7 for any assumed value of \(\gamma \ne 1\), providing choice probabilities under any alternative hypothesis.

Monte Carlo simulation can check the size of potential test statistics using the marginal probabilities (i.e., those that apply when the null hypothesis \(\gamma =1\) is true) as true choice probabilities. I draw 10,000 samples, each with \(N=200\) simulated subjects, from my population of simulated subjects. For each of those simulated subjects, I draw 100 Bernoulli variates \({c}_{j}^{n}\) based on their marginal choice probability as given by Eq. A.12. Then \({y}^{n}\) may be computed for each of the 200 simulated subjects in each sample, and then one may compute (in each sample) the p-values of test statistics against the null hypothesis in Eq. 2. For a nominal size of 5%, the actual size of t-tests, signed-rank tests, and sign tests from this Monte Carlo simulation are 5.06%, 5.15%, and 4.12%, respectively. As far as size goes, both the t-tests and the signed-rank tests look quite good, whereas the sign tests appear to be somewhat conservative.

Monte Carlo simulation can also check the power of potential test statistics, at various sample sizes, using the conditional probabilities (i.e., those that apply when the alternative hypotheses with \(\gamma \ne 1\) are true) as true choice probabilities. I draw 10,000 samples, each with \(N=200\) simulated subjects, from my population of simulated subjects. For each of those simulated subjects, Eq. A.12 is first used to compute marginal probabilities, and then Eqs. A.6 and A.7 are used to convert these into conditional probabilities with some \(\gamma \ne 1\). I draw 100 Bernoulli variates \({c}_{j}^{n}\) based on those conditional choice probabilities and the previous draw at \(j-1\) (each draw \({c}_{j-1}^{n}\) determines the conditional choice probability used to draw \({c}_{j}^{n}\)). Then \({y}^{n}\) may be computed for each of the 200 simulated subjects in each sample, and then one may compute (in each sample) the p-values of test statistics against the null hypothesis in Eq. 2.

When \(\gamma =2\) (the value of \(\gamma\) I specify for the alternative hypothesis of persistence), at a nominal size of 5% and with \(N=200\) simulated subjects per sample, t-tests, signed-rank tests, and sign tests reject the null hypothesis in 89.71%, 89.20%, and 81.20% of the 10,000 samples, respectively. These power figures show that both the t-tests and the signed-rank tests get very close to 90% power with \(N=200\), whereas the sign tests are noticeably less powerful than that. The alternative hypothesis of alternation (I specify \(\gamma =0.5\) for this) produces very similar results. The t-tests, signed-rank tests, and sign tests reject the null hypothesis in 90.37%, 90.13%, and 81.78% of the 10,000 samples, respectively. Again, both the t-tests and the signed-rank tests get very close to 90% power with \(N=200\), whereas the sign tests are noticeably less powerful than that.

I made the same calculations above for progressively larger samples (beginning at \(N=100\) and stepping this up in increments of 10) until the sample size produced roughly 90% power for both \(\gamma =2\) and \(\gamma =0.5\), which first occurs at \(N=200\). This is how the sample size was chosen.

Table A1 The lottery pairs