Axiomatic principles of decision-making have been at the forefront of psychological research over the last five decades, as a large body of work has questioned their empirical basis. However, past research has shown that it is crucial to use proper statistical methods to analyze the results of test of axiomatic principles. For instance, Regenwetter, Dana, and Davis-Stober (2011) found little evidence against the choice principle of transitivity when reanalyzing past published results with proper methods. The goal of the present study is to identify the methodological pitfalls of a frequently used measure of context effects and to propose new and more robust statistical alternatives to identifying context effects.

Past empirical research has shown that people’s preference for one target option depends on the choice set in which it is presented (e.g., Debreu, 1960; Tversky, 1972; Tversky & Russo, 1969; Rumelhart & Greeno, 1971; Busemeyer, Gluth, Rieskamp, & Turner, 2019; Simonson & Tversky, 1992; Tversky & Simonson, 1993; Roe, Busemeyer, & Townsend, 2001; Huber, Payne, & Puto, 1982; Trueblood, Brown, Heathcote, & Busemeyer, 2013; Dhar & Simonson, 2003; Mishra, Umesh, & Stem, 1993; O’Curry & Pitts, 1995; Wedell, 1991; Choplin & Hummel, 2005). Here, we consider three of the most studied context effects: (1) the similarity effect, which is the finding that people prefer a target option when it is presented in a choice set with two other dissimilar options compared to when it is presented in a set with one similar and one dissimilar option (Tversky, 1972); (2) the attraction effect, which is the finding that people prefer a target option when it is presented in a choice set with a similar but inferior option (Huber et al., 1982); and (3) the compromise effect, which is the finding that people prefer a target option when it is in between two more extreme options in the attribute space (Simonson, 1989). Crucially, these findings violate the independence from irrelevant alternatives (IIA) principle (Luce, 1959), which assumes that the relative preference for two options should not be affected by the presence of other available options (for an overview, see Rieskamp, Busemeyer, & Mellers, 2006).

These context effects have been observed across different domains and tasks: from perceptual decisions (e.g., what is the largest stimulus? e.g., Trueblood et al., 2013; Choplin & Hummel, 2005) to likelihood judgments (e.g., how likely is a runner to win a race? e.g., Windschitl & Chambers, 2004) to preferential decisions (e.g., which consumer product is most preferable? Amir & Levav, 2008; Wedell & Pettibone, 1996; O’Curry & Pitts, 1995; Mishra et al., 1993; Farmer, Warren, El-Deredy, & Howes, 2017; Tversky, 1972; Simonson & Tversky, 1992) to decisions under risk (Mohr, Heekeren, & Rieskamp, 2017; for reviews on context effects see, e.g., Busemeyer, Barkan, Mehta, & Chaturvedi, 2007; Busemeyer et al., 2019; Bettman, Luce, & Payne, 1998; Heath & Chatterjee, 1995; Neumann, Bckenholt, & Sinha, 2016). Moreover, a number of cognitive theories have been developed to explain how and why context effects arise (e.g., Tversky, 1972; Simonson & Tversky, 1992; Wedell, 1991; Tversky & Simonson, 1993; Roe et al., 2001; Usher & McClelland, 2004; Bhatia, 2013; Trueblood, Brown, & Heathcote, 2014; Wollschläger & Diederich, 2012; Noguchi & Stewart, 2018; Soltani, Martino, & Camerer, 2012; Louie, Khaw, & Glimcher, 2013; Howes, Warren, Farmer, El-Deredy, & Lewis, 2016; Spektor, Gluth, Fontanesi, & Rieskamp, 2019; for a recent review, see Wollschlaeger & Diederich, 2020; for systematic comparisons of models see Evans, Holmes, & Trueblood, 2019; Hotaling & Rieskamp, 2018; Turner, Schley, Muller, & Tsetsos, 2018).

Despite the large effort to explain context effects through the development of cognitive models, there has been relatively little effort to develop a statistically sound approach for testing whether the effects exist in the first place. Past work has already shown the challenges of a robust statistical analysis for context effects. For example, Hutchinson, Kamakura, and Lynch (2000) showed that context effects may arise from latent classes of participants who have strong attribute preferences but do not exhibit context effects within each participant class: Context effects can emerge on the aggregate level because of the different latent choice patterns of participants, which can remain unaccounted for by popular statistical tests. Liew, Howe, and Little (2016) provided further evidence for different latent classes of participants in context effects with a Bayesian clustering method. These studies thus suggest that looking only at the aggregate descriptions of the data can provide a misleading picture of context effects.

We propose a Bayesian approach to identifying context effects. Via simulations, we show that our approach is resistant to biases due to different numbers of observations per choice set, in contrast to a frequently used alternative approach. In addition, we reanalyze the data of five published experiments, showing that with our proposed method, the evidence for the existence of context effects partly differs from that reported in the original publications.

The decision problem

In traditional context-effect experiments, a pair of similar options (say options A and B) is embedded in two different choice sets (i.e., the choice contexts). In some studies, participants initially express their preferences for options A and B when presented as pairs, which is considered a baseline condition, and later the same options are embedded in triplets (e.g., Tversky, 1972; Malkoc, Hedgcock, & Hoeffler, 2013; Mishra et al., 1993; Huber et al., 1982; Simonson & Tversky, 1992; Dhar & Simonson, 2003; Wedell, 1991 among others).

In contrast to the traditional context-effect experiments, we focus on studies that use two triplets to measure context effects, an approach that has been used more often in recent work (e.g., Berkowitsch, Scheibehenne, & Rieskamp, 2014; Trueblood et al., 2013; Trueblood, Brown, & Heathcote, 2015; Farmer et al., 2017; Trueblood et al., 2014, among others). The use of two triplets provides the advantage that any effects of the contexts cannot be confounded with the number of options presented. In experiments with two triplets, two core stimuli (A and B) are embedded in two different choice sets consisting of three options each (i.e., {A,B,C} and {A,B,D}), whereby only the attribute values of the third option (C or D) change across contexts. To illustrate the point, consider two cars that trade off on two attributes (see Table 1): Car A is highly fuel efficient but is expensive, whereas car B is cheaper but less fuel efficient. These two baseline options are embedded in two triplets: In one triplet, car C is more fuel efficient than car A and more expensive (Set 1), and in the other triplet, car D is less fuel efficient than car B but is also less expensive (Set 2). This is an example of how one can elicit the compromise effect, according to which adding extreme options in the choice set makes average options seem like compromises: The relative preference for car A compared to B should increase in the first described set and decrease in the second set. Therefore, car A is the target option and car B is the competitor option with respect to the compromise effect in the first set, whereas car B is the target and car A is the competitor with respect to the compromise effect in the second set.

Table 1 Example of a multiattribute choice situation representing the compromise effect

Wedell (1991) introduced the two-triplet paradigm as a means to increase statistical power to detect the attraction effect, since the third option affects the two core stimuli (A and B) differently in the two choice sets (therefore, providing two opportunities for the emergence of the effect). A recent meta-analysis on the compromise effect confirmed that the two-triplet paradigm elicits the effect more strongly than the one-pair-one-triplet paradigm (Neumann et al., 2016). However, in analyzing the attraction effect, Wedell (1991) used ANOVAs on proportions, where the assumption of homogeneity of variance is by default violated (cf. Jaeger, 2008), and therefore his analysis was not optimal. Since then, the question of what is a robust methodological approach to test context effects has not been raised. We suggest an answer to this question by validating and introducing a Bayesian approach to estimating context effects.

The relative choice share of the target

To measure the effect of context, we first determine the choice frequency of the target in the first context C1 (i.e., nt,C1) and divide it by the choice frequency of the competitor and the target in the same context (i.e., nt,C1 + nc,C1), a measure called the relative choice share of the target (RST; cf. Berkowitsch et al., 2014). The RST for one context/triplet will deviate from .50 when either the target or competitor is preferred by the decision-maker. In a second step, the RST is determined for the second context C2 as well, that is, nt,C2 relative to nt,C2 + nc,C2. In the second context the options representing the target and the competitor switch their roles. Therefore, in the absence of a context effect, the average RST across both conditions is equal to .50. If the total frequencies with which the target and competitor are chosen across both context are identical (i.e., if nt,C1 + nc,C1 = nt,C2 + nc,C2), the RST can be determined as

$$RST_{\;\text{UW}} = \frac{n_{\mathrm{t},\mathrm{C}1} + n_{\mathrm{t},\mathrm{C}2}}{n_{\mathrm{t},\mathrm{C}1} + n_{\mathrm{t},\mathrm{C}2} + n_{\mathrm{c},\mathrm{C}1} + n_{\mathrm{c},\mathrm{C}2}},$$
(1)

following the definitions of (Berkowitsch et al., 2014). However, when the total frequencies with which the target and competitor are chosen differ across choices sets, the above procedure runs into problems. For example, assume that a participant chose Target1 30 times, Competitor1 20 times, Target2 10 times, and Competitor2 15 times (the indices correspond to Choice Sets 1 and 2, respectively) in the hypothetical car scenario presented above (cf. Table 1). Here, the RST1 in Context 1 is .60 and the RST2 in Context 2 is .40, so that the average is .5. However, following Equation 1, RST\(_{\text {UW}} = \frac {30+10}{30+20+10+15} = .53\), indicating a small compromise effect. Note that in this case, the IIA was not violated since the average RST was .5.

Collapsing choice observations across the two sets (as in Eq. 1) is mathematically equivalent to calculating the RST based on the weighted average between the two within-set RST proportions, where the weights are the sample sizes of the two choice sets (see Appendix A for more details). For this reason, we call this method of measurement RSTUW, because it allows for unequal weights.

If the total frequencies with which the target and competitor are chosen are different across the two choice contexts (i.e., if nt,C1 + nc,C1nt,C2 + nc,C2), the RST should be determined as

$$RST_{\;\text{EW}} = 0.5*\left(\frac{n_{\mathrm{t},\mathrm{C}1}} {n_{\mathrm{t},\mathrm{C}1} + n_{\mathrm{c},\mathrm{C}1}}+ \frac{n_{\mathrm{t},\mathrm{C}2}} {n_{\mathrm{t},\mathrm{C}2} + n_{\mathrm{c},\mathrm{C}2}}\right).$$
(2)

Equation 2 is the simple average of each within-set RST (cf. Spektor et al., 2019). Because the simple average weights each sample size equally, it is denoted as RSTEW. In our car example, RST\(_{\text {EW}} = .5*(\frac {30}{30+20} + \frac {10}{10+5}) = .50\), which now correctly shows no compromise effect. RSTEW is therefore equally informed by the uncertainty of both within-set RST ratios, namely, RST1 and RST2. Note that when the total frequency of choosing the target and competitor is equal in both contexts, RSTEW and RSTUW are identical.

Several studies have used the RSTUW in recent years (e.g., Trueblood et al., 2015; Trueblood et al., 2014; Trueblood, 2012; Trueblood et al., 2013; Berkowitsch et al., 2014; Spektor, Kellen, & Hotaling, 2018; Liew et al., 2016; Evans, Holmes, Dasari, & Trueblood, 2021). Importantly, none of these studies has examined the assumption that the choice frequencies of both target and competitor are equal across different choice sets. Although a few studies have used variants of RSTEW (e.g., Spektor et al., 2019; Turner et al., 2018; Molloy, Galdo, Bahg, Liu, & Turner, 2019), they did so by citing Berkowitsch et al. (2014), who used the RSTUW measure instead. None of the aforementioned studies has questioned or examined the difference between RSTUW and RSTEW and their implications for statistical inference in a systematic way. In the next section, we use a simulation study to address this issue. Crucially, we propose a Bayesian formulation of the RSTEW for the first time.

A simulation study of RST measures

In the previous section, we showed that the simplified RSTUW measure can lead to incorrect inferences about possible violations of IIA, so that RSTEW should always be preferred. However, RSTUW has been used often in past work, so it is worthwhile to question whether the approximation of RSTUW can lead to substantially biased inferences. We addressed this question via simulations. In the following, we show (i) how large the choice frequency differences in the two choice sets have to be so that RSTUW leads to biased inferences, and (ii) whether RSTUW’s bias is affected by the strength of the underlying context effect. To do this, we simulated a population of subjects under different target/competitor sample size and effect size manipulations. We then tested whether RSTUW and RSTEW identify the true underlying context effect in the population.

We employed Bayesian and frequentist hypothesis tests. From the frequentist family, we used the t test (since this test has been used for RST in the literature before), which evaluates whether the mean RST of participants is equal to 50%. As a Bayesian test, we used a hierarchical version of the binomial distribution based on previous work (Trueblood, 2015). We simulated different scenarios, varying the presence or absence of the context effects and the sample-size difference between the two choice contexts.

Specifically, for the RSTUW version, we assumed that each participant’s RST is represented by a binomial (success) rate parameter 𝜃 and that all individual 𝜃 are sampled, at the group level, from a beta distribution with mean parameter μ (which also has a beta distribution) and a concentration parameter κ (which has a gamma distribution). We used a similar parameterization and the same prior distributions as in Trueblood (2015) and Trueblood et al. (2015). The mean and the concentration parameters were related to the alpha and beta parameters of the RST beta distribution as follows: a = μκ and b = (1 − μ)κ. For the RSTEW version, we estimated separate individual and group-level 𝜃 s across the two sets (i.e., 𝜃1, and 𝜃2 at the individual level, with mean and concentration μ1, μ2, κ1, and κ2 at the group level). For a graphical representation of the structure of both hierarchical models, see Appendix B.Footnote 1

To test the context effects in the Bayesian framework, we used Bayes factors (BFs), which quantify how much more likely the data are under the alternative hypothesis than under the null (or vice versa). Crucially, and unlike p values, BFs can quantify evidence in favor of the null hypothesis, and not just in favor of the alternative hypothesis (Kass & Raftery, 1995; Lee & Wagenmakers, 2014; Wagenmakers et al., 2018; Aczel et al., 2018). In practice, to calculate BFs, we separately fit two models: the alternative hypothesis model, in which all previously described parameters were free to vary and were therefore estimated from the data, and the null hypothesis model, which is a constrained version of the first. In particular, in the case of RSTUW, μ = 0.50 and was not estimated, and in the case of RSTEW, μ2 = 1 − μ1, which is equivalent to setting their average to 0.50 [(μ1 + μ2)/2 = 0.50].

We simulated different data sets, varying (1) the generating group-level binomial-rate mean parameters, and (2) the magnitude of the sample-size difference between the two choice sets. For each sample-size-difference level, we simulated 100 data sets and performed 100 independent hypothesis tests. In total, there were 59 sample-size-difference levels: For one, both sets had the same number of observations (i.e., 60 in one and 60 in the other), and for the rest we kept the sample size of one fixed (to 60 observations) and changed the sample size of the other by one unit until we had only one observation (i.e., 59, 58, ..., until 1). To use realistic generating parameter values, we took the mean posterior concentration κ parameter of the RSTEW measure applied to the data of Trueblood et al. (2015), which is a recent study on context effects (also included in our reanalysis study). We also assumed 60 observations maximum per set, which was the average participant set sample size in Trueblood et al. (2015). Specifically, we simulated 55 participants and fixed the concentration parameter of the parent beta distribution to 5. We further fixed the mean of the parent beta distribution to different RST levels such as 50, 55, and 60% to simulate different effect-size scenarios of context effects (see Fig. 1 for more details). Finally, we assumed that the sample-size difference was not the same across all participants but came from a truncated normal distribution with the intended sample-size difference as mean and a standard deviation equal to 5, to create realistic scenarios.

The simulation was performed in R. The Bayesian models were estimated using rstan through the No-U-Turn sampler (Carpenter et al., 2017). For the sampling procedure, we ran three independent chains of 1500 posterior samples each, 500 of which were used as warm-up and therefore discarded (Carpenter et al., 2017; Gelman, Carlin, Stern, Duson, & Vehtari, 2013). For the RSTUW measure, we adopted the same prior distributions proposed by Trueblood et al. (2015) and Trueblood (2015). The marginal likelihoods of the models were estimated through the bridge-sampling method (Gronau, Singmann, & Wagenmakers, 2020; Gronau et al., 2017). The marginal likelihoods are normalizing constants of the joint posterior distributions and are necessary to calculate BFs. They often involve calculations that lack analytical solutions. Bridge sampling can approximate BFs more accurately than other methods, such as the (naive version of the) Savage–Dickey density ratio (e.g., see Heck, 2019).

Fig. 1
figure 1

Results of simulation. The log Bayes factors (BFs) are presented in the two different RST methods (RSTUW and RSTEW). Black dots indicate means per unit N difference; gray dots indicate raw BFs. Means are given with confidence intervals. The dashed lines indicate the thresholds BF10 = 3 and BF10 = 1/3. RST = Relative choice share of the target (the index indicates Set 1 or 2, respectively); EW = equal weights; UW = unequal weights. Upper panel: Null hypothesis is true. Bottom panel: Alternative hypothesis is true

In the following, we report the results of the Bayesian analyses (for the results of the frequentist analyses, see supplementary materials). Figure 1 shows the results of the simulation where the two sets have unequal sample sizes and, therefore, where RSTEW and RSTUW diverge.

As expected, when the null hypothesis was true (i.e., when there was no violation of IIA in the generating data), RSTUW showed a bias in favor of the alternative hypothesis and the bias grew with higher differences in sample sizes between the two choice sets. This bias was exacerbated when the binomial rate parameter of the two sets was closer to 0 or 1. On the other hand, RSTEW showed no bias toward the alternative hypothesis, irrespective of sample-size differences between the two context sets. The RSTUW measure was also biased toward the null hypothesis when the alternative hypothesis was true (i.e., when there was a violation of IIA in the generating data), whereas the RSTEW measure still remained unbiased. Crucially, the RSTUW measure is not always biased toward the alternative hypothesis: It can, in fact, be biased either way, depending on which choice set has the largest sample size. For example, RSTs that are closer to .50 and come from the set with the larger sample size will drag the RSTUW toward .50, biasing the result toward the null hypothesis. Alternatively, RSTs of one set that are not closer to .50 and have a larger sample size than the other set will push RSTUW away from .50, thus biasing the results toward the alternative hypothesis. The larger the sample-size differences across sets, the more biased the results of the RSTUW measure were. In sum, our simulations showed that the RSTUW measure produces false negatives or false positives when the total number of target and competitor choices is unequal across the two choice contexts.Footnote 2

Reanalyses of past studies

The simulations showed that RSTUW is a biased measure of context effects whereas RSTEW proves to be a robust measure. To examine whether RSTUW might have led to inaccurate conclusions in previous studies, we reanalyzed the data of five published articles using both RSTUW and RSTEW measures and evaluated their agreement.

To select which studies to reanalyze, we searched for original research articles examining the attraction, similarity, and compromise effects among all issues of four major psychological journals (Journal of Experimental Psychology: General, Psychological Review, Psychological Science, Psychonomic Bulletin & Review) published in the past decade (2010–2019). First, we identified 811 articles that contained the following keywords in their title and/or abstract: context (effect), attraction (effect), compromise (effect), similarity (effect). From this set of articles, we selected only those 17 in which options were characterized by more than one attribute. From this list, we selected five (i.e., Berkowitsch et al., 2014; Liew et al., 2016, Trueblood et al. 2015, 2014, Cataldo & Cohen, 2019) that had original data examining all three effects using a within-subject design (see supplementary materials for details). The within-subject design was necessary to examine the presence of correlations between context effects because of their hypothesized theoretical importance (e.g., Berkowitsch et al., 2014; Trueblood et al., 2015).

The study of Berkowitsch et al. (2014) and Study 2 from Liew et al. (2016) involved preferential tasks where individuals could choose between consumer products (e.g., notebook computers) with different attributes (e.g., weight in kilograms and battery life in hours). Although the study of Liew et al. (2016) is a replication of that of Berkowitsch et al. (2014), the former has a considerably larger pool of subjects (i.e., 134, compared to 48 in the original study). Cataldo and Cohen (2019) tested the effect of the presentation format on context effects in preferential tasks as well. We focus here on their results from the condition of by-alternative (vs. by-attribute) presentation format of stimuli, because this is more comparable with the rest of the studies included in our reanalysis. The study of Trueblood et al. (2015), on the other hand, involved a perceptual decision-making task where participants had to correctly indicate which rectangle had the largest area, given their different length and height attributes. Finally, in the study of Trueblood et al. (2014), participants were asked to indicate the likely murderer from a triplet of suspects.

The five studies included for reanalysis are of empirical importance because two of them (Trueblood et al., 2015, 2014) have been the basis for development of cognitive modeling, and one (Berkowitsch et al., 2014) rigorously tested different psychological models. In addition, all studies represent context-effect research in different domains: perceptual discrimination, suspect judgment, and consumer decisions. Finally, four of them (i.e., Berkowitsch et al., 2014, Trueblood et al. 2015, 2014; Liew et al., 2016) used the RSTUW as a measure of context effects.

To preprocess the data of the five studies, we applied the procedures that were described in the relative published articles. Thus, we performed our analyses on the same data sets that were originally analyzed. Our reanalyses consisted of the same two measures that we utilized in the simulations: RSTUW and RSTEW. Specifically, we estimated the BF in favor of the null (i.e., no IIA violation) and alternative (i.e., IIA violation) hypotheses, separately for both RST measures. We estimated all Bayesian models separately for context effect and study. Overall, we thus fitted 60 Bayesian models (3 context effects × 5 studies × 2 RST measures × 2 hypotheses). All parameters of the Bayesian models were estimated in Stan (Carpenter et al., 2017) through the No-U-Turn sampler with six independent chains, each of which consisted of 20,000 posterior samples where the first 1000 were discarded as warm-up (all other settings of model structure and fitting were the same as in our simulation study; for details see Appendix B). The prior distributions of the Bayesian models were the same as in Trueblood et al. (2015). All models converged with \(\hat {R}\) always lower than 1.01.

In our simulations, the two RST measures led to the same inferences when there was no sample-size difference across the two core option sets, but they diverged when these differences were substantial. However, by reanalyzing previously published data, we aimed at understanding how large these differences are in empirical data and whether such differences can also lead to biased RST conclusions. Substantial differences might especially occur in the similarity and compromise effect conditions. This is because the third added option could represent an attractive option (as it is more extreme on one attribute’s scale) and the attractiveness of the third option could vary across sets because of different attribute preferences. This may lead to different sample sizes of the core options. In contrast, in the attraction effect, the dominated decoy option is not chosen very often, leading to similar sample sizes for the two core options. Therefore, we predicted that the two RST measures would not substantially diverge in the attraction effect condition but would disagree in the similarity and compromise effect conditions.

We examined the sample-size differences between the two sets across all studies and context effects (Fig. 2). Densities of sample-size differences that are centered at zero indicated no substantial sample-size differences. Non-zero-centered distributions indicated that one set has more observations than the other on average. As we expected, the distributions of the attraction effect were mostly centered around zero. On the other hand, the density distributions in the similarity and compromise effects in the Liew et al. (2016), and Trueblood et al. (2015, 2014) studies were shifted away from zero.

Fig. 2
figure 2

Empirical density distributions of relative percentage change in the sample size across the two choice sets. Ticks below the distributions indicate raw observations. Zero is marked by a dotted line

Results from highest density intervals

To perform hypothesis testing, we looked at the 95% highest density intervals (HDI95%) of the posterior distributions of the hierarchical RST mean. Unlike for the simulation study, where the goal was to evaluate the strength of evidence in favor of or against the null/alternative hypotheses under RSTEW and RSTUW, we focus on the HDIs of the RST measure in this section, since HDIs provide information about the effect size. Specifically, HDIs not only clarify if there is an effect but they also provide information about the strength and direction of the effect (i.e., if RST is below or above 50%). Therefore, we report the results of hypothesis testing of HDIs in the main text and, for simplicity, BFs in Appendix C (we followed the same procedures to derive BFs as in our simulation study; cf. Lee & Wagenmakers, 2014; Kass & Raftery, 1995; Gelman et al., 2013; Kruschke & Liddell, 2018; Wagenmakers et al., 2018; Dienes, 2016).

Figure 3 shows the posterior distributions of the group-level mean of the RST measures. Table 2 provides their respective HDIs. For the attraction effect, both RSTUW and RSTEW led to the same qualitative conclusions in all five studies. Four of the five studies supported the presence of an attraction effect (i.e., the mean posterior of both RST measures did not include 50%); the study of Cataldo and Cohen (2019) did not.

Fig. 3
figure 3

Posterior distributions of the hierarchical mean RST in the five studies included for reanalysis. The posteriors of both the RSTUW and the RSTEW measure are presented. In the RSTUW measure, the hierarchical mean is plotted directly, whereas in the RSTEW measure, the hierarchical mean is computed as the arithmetic average of the mean hierarchical posteriors of the two sets. RST = Relative choice share of the target; EW = equal weights; UW = unequal weights

As expected, a disagreement between the two RST measures was observed for the compromise and similarity effects: In the study by Trueblood et al. (2014), a compromise effect was identified when relying on RSTUW but not when using the unbiased RSTEW measure, whereas in Liew et al. (2016), RSTEW established the compromise effect in contrast to RSTUW. In the other three studies, identical conclusions regarding the compromise effect were drawn when relying on either measure. Concerning the similarity effect, in the studies by Cataldo and Cohen (2019) and Trueblood et al. (2014), a similarity effect was identified when relying on RSTUW but not when using the accurate RSTEW measure. In contrast, in the other three studies, identical conclusions regarding the similarity effect were drawn when relying on either measure Fig. 4.

In sum, the RSTUW measure can lead to incorrect inferences. Overall, we tested three effects in five studies across a variety of decision domains. Both measures led to the same qualitative results in 11 of 15 cases of context effects. However, in one-fourth of the studies (i.e., four cases) where RSTUW erroneously established an effect were the similarity effect in Cataldo and Cohen (2019) and Trueblood et al. (2014), and the compromise effect in Liew et al. (2016) and Trueblood et al. (2014). Generally, the comparison based on the HDIs reveals that the RSTEW measure tends to establish posterior RST means with higher uncertainty. Moreover, the RSTEW measure led to mean RST posterior distributions that included 50% more often compared to RSTUW, in the similarity and compromise effect conditions. In only a few cases did the RSTEW measure conclude that the mean RST posterior is not 50%, in contrast to the conclusions of the RSTUW measure.

Table 2 Upper and lower 95% HDI cutoffs of posterior distributions of the hierarchical mean RST distributions by study, method (unequal vs. equal weights), and context effect

So far we have examined the extent to which RSTEW and RSTUW make the same inferences when applied to the data of the five articles that were included in the reanalysis following our Bayesian approach. However, if one used the accurate RSTEW measure, the qualitative results of the statistical inference might coincide with the conclusions reported in the original papers. For this reason, we compared the inferential results of the RSTEW measure to the originally reported statistics (cf. Table 3).

To make the comparison more direct, we followed the same statistical test and framework (i.e., frequentist or Bayesian) of the original studies while switching from RSTUW to RSTEW. Specifically, for Trueblood et al. (2014) we used a one-sample frequentist t test on RSTEW, for Liew et al. (2016) we used a Bayesian one-sample t test on RSTEW, and for Berkowitsch et al. (2014) and Trueblood et al. (2015) we used our proposed RSTEW Bayesian measure (because the last two studies used a Bayesian formulation of the RSTUW). In all Bayesian analyses, we employed the same prior distributions reported in the original articles. Finally, we did not include Cataldo and Cohen (2019) in the comparison for the following reason: The authors used a regression model with several main effect and interaction terms. Because we looked only at a subset of their data (i.e., by-alternative presentation condition), and given that regression coefficients are conditional on the data and the regression terms included in the model, we excluded the study from the comparison.

Table 3 Comparison of RSTEW reanalysis results and originally reported test results

Table 3 shows that in all the studies in which the RSTUW was used (i.e., Berkowitsch et al., 2014; Trueblood et al., 2015; 2014; Liew et al., 2016), no evidence for the alternative hypothesis (i.e., that RST is not equal to 50%) was found under the RSTEW measure in five of the total of 12 cases. Specifically, the attraction effect was supported in all studies (i.e., Berkowitsch et al., 2014; Trueblood et al., 2015; 2014; Liew et al., 2016), whereas the compromise effect was observed only in Berkowitsch et al. (2014) and Trueblood et al. (2014).Footnote 3 On the other hand, the similarity effect was found only in the study of Trueblood et al. (2015). Generally, the unbiased RSTEW measure made different qualitative conclusions than those originally reported in two of 12 cases (i.e., in the compromise effect of Liew et al., 2016, and in the similarity effect of Trueblood et al., 2014).

Correlations among context effects

To understand the underlying cognitive mechanisms driving the effects, research on context effects has also examined the correlation between effects. In previous studies, a positive correlation between the attraction and compromise effects has been observed, along with negative correlations between the compromise and similarity effects, and between the similarity and attraction effects (e.g., Berkowitsch et al., 2014). These specific correlations play a significant role in the evaluation of psychological theories, because their existence might imply that similar cognitive mechanisms cause certain context effects (for a recent large-scale replication of these correlations, see Dumbalska, Li, Tsetsos, & Summerfield, 2020). For this reason, we determined the correlations among effects for the five studies we reanalyzed (see Table 4 for HDIs of correlation coefficients). We found that the correlation coefficients were mostly similar between the RSTUW and RSTEW measures with the only exception being the correlation between the similarity and compromise effects in Trueblood et al. (2015).

Therefore, in contrast to the RST analysis presented above regarding the population mean of the RST, the two RST measures largely agreed in their qualitative conclusions regarding the correlation between the context effects. This happened for two reasons. First, correlations of two variables are modeled according to the multivariate normal distribution (which has marginal means and the variance–covariance matrix as parameters; marginal refers to the parameter or distribution of one of the two variables that enter the correlation). Correlations are not affected by changes in the location of the marginal means (e.g., the mean of the two marginal distributions whose correlation we estimate). Therefore, although RSTEW and RSTUW might disagree on the RST mean of the group, such disagreement might not affect the correlation coefficients.

Second, correlations can be affected by differences in the marginal variance of RST distributions (i.e., larger variance differences across marginal distributions render correlations more and more difficult to find). We evaluated the marginal variance of the group-level RST distributions for both RSTEW and RSTUW (for more details see the supplementary materials) and we found that the two measures produced similar marginal variances for each context effect, thus preserving the effect covariation. This explains why the two RST measures produced similar qualitative results regarding correlations. We can, therefore, conclude that context-effect correlations are generally a more robust pattern than RSTs, even in the presence of sample-size differences across choice sets.

Table 4 Coefficients for context-effect correlations by study and RST method

A note on detecting violations of the regularity principle

So far, we have discussed two ways of identifying violations of the IIA principle in a sample, namely RSTEW and RSTUW. We showed that RSTEW has advantages over RSTUW and that published studies that used the RSTUW can, sometimes, lead to erroneous inferences.

The regularity principle (Luce, 1977) is conceptually related to (but not logically implied by) the IIA. According to regularity, adding an option to a choice set should never increase the choice probabilities of the options from the original set (for a review see Rieskamp et al., 2006). The attraction effect was historically taken as an instance of an empirical illustration of a violation of the regularity principle (Huber et al., 1982). In contrast with the proposed RSTEW measure—which tests only for violations of the IIA principle—we introduce a measure appropriate for testing violations of the regularity principle. This measure should be used specifically to test the presence of the attraction effect.

The absolute choice share of the target and competitor

Formally, the regularity principle states that for any option x that is part of the option sets X and Y it should hold that when \(X \subseteq Y\), PX(x) ≥ PY(x). A direct test for the regularity principle is to use a one-pair-one-triplet experimental design, where participants initially express preferences for two options and then again after a third option is added to the choice set. In this design, a direct violation of the regularity principle occurs if the probability of either option originating from the pair set increases in the triplet set. Many studies have used this design, including the original attraction effect study (i.e., Huber et al., 1982).

However, the focus of the present work is the two-triplet design. In this design, two options A and B are embedded in two different triplets. Each triplet is made with the addition of a decoy option: D1, which is close to the target option A (Context 1; C1) and D2, which is close to the target option B (Context 2; C2). Although in this design the choice probabilities of A and B in the pair {A,B} are never observed, we can deduce relations between the two-triplet choice sets if regularity holds (as shown in Appendix D in more detail). Therefore, the two-triplet design can indirectly test for violations of the regularity principle.

Specifically, we propose the absolute choice share of the target (AST) and absolute choice share of the competitor (ASC) as measures for the attraction effect and the reversed attraction effect (for details about their derivation see Appendix D):

$$AST = 0.5*\left(\frac{n_{\mathrm{t},\mathrm{C}1}} {n_{\mathrm{t},\mathrm{C}1} + n_{\mathrm{c},\mathrm{C}1} + n_{\mathrm{d},\mathrm{C}1}}+ \frac{n_{\mathrm{t},\mathrm{C}2}} {n_{\mathrm{t},\mathrm{C}2} + n_{\mathrm{c},\mathrm{C}2}+ n_{\mathrm{d},\mathrm{C}2}}\right),$$
(3)
$$ASC = 0.5*\left(\frac{n_{\mathrm{c},\mathrm{C}1}} {n_{\mathrm{t},\mathrm{C}1} + n_{\mathrm{c},\mathrm{C}1} + n_{\mathrm{d},\mathrm{C}1}}+ \frac{n_{\mathrm{c},\mathrm{C}2}} {n_{\mathrm{t},\mathrm{C}2} + n_{\mathrm{c},\mathrm{C}2}+ n_{\mathrm{d},\mathrm{C}2}}\right),$$
(4)

where nt, nc and nd refer to the choice frequencies of the target (t), competitor (c), and decoy (d), respectively. Regularity is satisfied if both AST and ASC are below or equal to 50%. AST > 50% indicates the presence of an attraction effect, whereas ASC > 50% indicates the presence of the reverse of the attraction effect.Footnote 4 Note that AST ≤ 50% or ASC ≤ 50% alone does not necessarily imply no regularity violation. Therefore, if one is agnostic about the hypothesized direction of the regularity violation, one should look at both AST and ASC to see if either of them is above 50%.

Reanalyses of past studies

Although the RST is different from the AST (as the former evaluates violations of the IIA principles in the similarity and compromise effects, and the latter evaluates violations of the regularity principle in the attraction effect), many studies that employed two-triplet experimental designs have instead used the RST to analyze attraction effect trials (e.g., Spektor et al., 2018; Trueblood et al., 2013; Berkowitsch et al., 2014; Trueblood et al., 2015; Trueblood, 2012; Spektor et al., 2019; Evans et al., 2021, among others). In this section, we propose a Bayesian formulation of AST and ASC and, furthermore, we apply the AST to the data of published studies.

We created a Bayesian model to infer AST and ASC from a sample, which is an extension of the Bayesian formulation of the RSTEW. Specifically, we modeled each participant’s choice probabilities as a multinomial simplex vector \(\vec {\theta }\). All participants’ \(\vec {\theta }\) were constrained from a group-level Dirichlet distribution with a simplex mean vector \(\vec {\mu }\) and a concentration parameter κ. \(\vec {\mu }\) and κ followed the Dirichlet and the gamma distribution, respectively. We used a Dirichlet prior of (2,2,2) on \(\vec {\mu }\), which is the multinomial-equivalent of the prior we used for the RSTEW measure in case of three alternatives. For κ, we employed a gamma prior of (0.001,0.001), which is the same that we used for the RSTEW measure as well. Crucially, as with the RSTEW measure, we estimated different hierarchical and low-level parameters across the two choice sets. Therefore, AST is derived from the average between the posterior of the target in \(\vec {\mu _{1}}\) (i.e., from Set 1) and the posterior of the target in \(\vec {\mu _{2}}\) (i.e., from Set 2), and similarly for ASC but with the posteriors of the competitor option.

Table 5 presents the results of AST reanalysis of the studies that were used in the reanalysis of RSTEW in the previous section (i.e., Berkowitsch et al., 2014; Trueblood et al., 2015; 2014; Liew et al., 2016). In one of four cases (i.e., Trueblood et al., 2015), no evidence of the attraction effect was found under AST, whereas the alternative hypothesis (i.e., that AST is higher than 50%) of an attraction effect was supported in the original study. For all other studies, the qualitative results of the AST measure corresponded to that originally reported. Generally, under AST, the strength of the attraction effect was less strong (i.e., the mean posterior distributions were closer to the null hypothesis).

Table 5 Comparison of AST reanalysis results in attraction effect trials and originally reported test results

Discussion

The current work examines the statistical analysis of context effects in multiattribute decision making. In particular, when determining the effect of a context in triplet designs, it is important to be aware of biases caused by differences in the choice frequency of the target and competitor options across choice sets. First, the often-used RST method for context effects (i.e., RSTUW) is not robust to such biases as compared to an RST that calculates the pooled mean across different choice sets (i.e., RSTEW), and second, it is not appropriate for the attraction effect, where the AST should be used instead. Furthermore, the conclusions of previously published studies changed in one-fourth of the cases when reanalyzed with robust and appropriate methods. Our results emphasize the importance of devising and evaluating statistical tests before empirically testing axiomatic principles of decision making.

Specifically, we first showed through a simulation study that different conclusions can be drawn whenever the choice frequencies for the two core options differ substantially between contexts. When the within-set RST is closer to 0 or 1, even a difference of half the sample size between the two choice sets can make the RSTUW approximation be biased. With within-set RST closer to .50, larger sample-size differences are required to bias RSTUW. Second, we examined if the use of the accurate RSTEW would change the conclusions of past studies that had used RSTUW. For this, we reanalyzed the data of five published studies on context effects. The results showed substantial differences: The two RST methods disagreed in 25% of the cases when considering the HDIs. In cases of disagreement, RSTEW mostly (but not always) favored the null hypothesis, whereas RSTUW indicated an effect where in fact no effect occurred. The disagreement concerned the similarity and compromise effects, where the choice frequencies can differ substantially across contexts. In cases of the similarity and compromise effect, the third options can represent an attractive option, so it might be chosen with high frequency. This can lead to large sample-size differences across contexts (cf. Fig. 2). In contrast, in cases of the attraction effect, the third option is a dominated option, so that it is rarely chosen and thus does not modify the overall choice frequencies of the two core options as much as is the case for the similarity and compromise effects.

We further looked at the differences of BFs between RSTEW and RSTUW when applied to the reanalysis of past studies (see Appendix C for more details). The BFs showed that the RSTEW and RSTUW measures disagreed in 40% of the cases, which indicates an increased disagreement rate compared to the HDIs of the two RST measures. Generally, we found that BFs were more conservative than HDIs in supporting the alternative hypothesis (cf. Wagenmakers, Lee, Rouder, & Morey, 2019).

Interestingly, when relying on the RSTEW measure, we observed less evidence for context effects. According to the BF analysis, at least moderate evidence for the existence of context effects was observed in only 26% of the cases, and likewise the HDIs indicate an effect in only 46% of the cases. Therefore, our results corroborate the finding that context effects can be hard to find on the aggregate level (sometimes called “the fragile nature” of context effects according to Trueblood et al., 2015). In sum, our results show that it is important to use the accurate RSTEW measure to identify context effects, because the RSTUW measure is prone to biased conclusions.

The question of how to collapse choices across different sets of options to compute the RST is also relevant in experimental designs where a baseline condition with the two core options is added to the condition with two triplet sets. For example, Turner et al. (2018) used a modified version of the RSTEW that adjusted for the baseline probabilities of the two core options. In addition, experimental designs that employ only a binary and a ternary choice set may avoid the question of collapsing observations since there is no target option in the binary set. Future research should thus examine and compare existing methods of hypothesis testing in these different experimental designs.

Crucially, we also make the novel contribution of illustrating that the RST measures are not suitable for identifying violations of the regularity principles. Instead, in the case of the attraction effect, the AST and ASC should be used instead of RST measures. Unlike the RST, the AST and ASC measures represent proper tests of the regularity principle. In contrast, the RST measures only test for violations of the IIA principle (i.e., similarity and compromise effects). For the purpose of hypothesis testing, we proposed a Bayesian formulation of the AST, which is a generalization of the Bayesian model of the RSTEW measure. In addition, after reanalyzing past studies, we observed that the attraction effect was estimated to be smaller using the AST compared to the RST measure. In one case the effect also disappeared (i.e., Trueblood et al., 2015). These results highlight the importance of employing the unbiased RSTEW in case of IIA violations and the AST/ASC in case of regularity violations.

Throughout our analyses we used the Bayesian framework for hypothesis testing. We did so because we believe this framework has advantages over traditional null-hypothesis significance testing (cf. Wagenmakers et al., 2018; Lee & Wagenmakers, 2014). However, the bias of the RSTUW measure persists even if one resorts to frequentist statistics, as we showed in our simulation (see supplementary materials). Therefore, our results are informative also for researchers who wish to implement their analyses in the frequentist framework instead.

The measures we proposed apply not only to the three popular context effects (i.e., attraction, similarity, and compromise effects) but also to additional context effects that are elicited through two ternary choice sets.Footnote 5 As a proof of concept, we reanalyzed the data of Spektor et al. (2018), who investigated the emergence of the reversal of the attraction effect (i.e., the so-called repulsion effect) with different incentivization schemes with perceptual stimuli (for details and results see supplementary materials). Interestingly, the authors found a repulsion effect in both the gain and the loss domain of their Experiment 1. Although Spektor et al. (2018) used the RSTUW measure, the proper tests for violations of the regularity principle are the AST and the ASC. Our reanalysis with these absolute measures indicated that, unlike the authors’ conclusions for their Experiment 1, there was no repulsion effect in either the loss or the gain domain.

In our simulation study, we showed that RSTEW circumvents the problem of unequal attribute preference in context-effect experiments by modeling the RST of each choice set separately. However, we employed the RSTEW only as a measurement tool and not as a cognitive explanation of how attribute preferences arise (in contrast to cognitive process models such as Trueblood et al., 2014; Roe et al., 2001; Bhatia, 2013; Usher & McClelland, 2004; Noguchi & Stewart, 2018; Howes et al., 2016; Spektor et al., 2019). Researchers who are interested in explaining the cognitive underpinnings of human behavior could use the RSTEW measure as a starting point to empirically establish the presence of context effects before building more complex (cognitive) models to better understand the behavior of participants. Therefore, our work is of great importance for theory advancement.

Our work is in line with recent calls to revisit the assumptions of traditional statistical methods to achieve higher levels of reproducibility and statistical clarity in the field of psychology (e.g., Wagenmakers, 2007; Ioannidis, 2005; Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012; Munafó et al., 2017; Cumming, 2014; Gigerenzer & Marewski, 2015; Nuzzo, 2014). Although the field of decision making has recently seen a steep increase in cognitive models, much less attention has been paid to the methodological challenges characterizing the statistical analysis of the effects the models aim to explain. As shown in the present work, these challenges are nontrivial since they may lead to biased conclusions if they are not adequately dealt with. We believe that developing robust statistical tests that are able to conclude the presence or absence of psychological effects should be given high priority.