We face many choices every day, each of which may be guided by subtle but powerful forces. One such force lies in the way information is presented. Consider the following thought experiment: Suppose you have invested $6,000 in stocks of a green energy company. As the economy is currently experiencing a downturn, you have two alternatives, one will save $2,000 of your investment, while the other one saves everything with a 33% chance and nothing with a 67% chance. Which of these strategies would you prefer?

Now, consider that you instead learned that one strategy will make you lose $4,000 of your investment, while the other one makes you lose nothing with a 33% chance and everything with a 67% chance (adapted from Bruine de Bruin et al., 2007). How would you choose based on this information?

Although the two descriptions are objectively equivalent, individuals typically choose the safe option (i.e., saving $2,000) when framed as gains rather than losses, and the risky option (i.e., losing nothing with a 33% chance and everything with a 67% chance) when framed as losses rather than gains. This well-established phenomenon is known as the risky-choice framing effect, whereby systematically different responses are elicited by the way the risk (gains vs. losses) is presented (Levin et al., 1998; Tversky & Kahneman, 1981). A related phenomenon occurs when characteristics of objects (e.g., treatment) are described in a positive (e.g., 50% success rate) or negative (e.g., 50% failure rate) way, which is known as the attribute framing effect (Levin et al., 1998; Levin & Gaeth, 1988).

Resistance to risky-choice and attribute framing

The framing effect and its underlying mechanisms have been studied in various experimental contexts (e.g., Bloomfield, 2006; Fagley & Miller, 1997; Finn, 2008; Levin & Gaeth, 1988; Whitney et al., 2008) since it had first been demonstrated (Tversky & Kahneman, 1981). Thereby, competing claims about the framing effect and its mechanisms (e.g., Chick et al., 2016, Mandel & Kapler, 2018) are often based on a single task (e.g., variations of the Disease problem, Tversky & Kahneman, 1981; see Piñon & Gambara, 2005), limiting the reliability and validity of the findings. Different framing tasks in different domains (e.g., money or lives) are used to support these claims, which makes it difficult to compare the evidence for or against the competing viewpoints because the framing effect is sensitive to task content (Fagley & Miller, 1997). In addition, framing tasks are conceptualized for between-participants administration. This does not allow for calculating individual difference scores and thus investigating individual differences of framing effects (but see Mandel & Kapler, 2018).

The Resistance to Framing scale, a subscale of the Adult Decision-Making Competence Index (A-DMC; Bruine de Bruin et al., 2007), solves the above problems by using 14 positively and negatively framed tasks for both types of framing which vary by topic (e.g., money, human and animal lives, health). The scale thus potentially provides a unified (standard) measure of Resistance to Framing that is relatively content-independent. The within-participants administration of the scale allows for investigating individual differences by calculating individual scores. This within-participants administration is often used to identify cognitive predictors of Resistance to Framing, such as numeracy (Del Missier et al., 2012), cognitive ability (e.g., Raven’s standard progressive matrices; Bruine de Bruin et al., 2007), working memory (Hoffmann et al., 2020), cognitive and executive functions (e.g., cognitive flexibility and inhibitory control; Mäntylä et al., 2012, Piryaei et al., 2017), as well as cognitive reflection (Allred, 2018).

Although the scale is widely used, its performance has rarely been extensively tested. Previous research has exclusively tested the scale’s performance in terms of internal consistency and predictive validity in a US-American sample (Bruine de Bruin et al., 2007). Besides this research, several studies have validated the A-DMC as a whole and reported the internal consistency of the Resistance to Framing scale, which varied from Cronbach’s ɑ = .30 to .72 (Bavolar, 2013; Del Missier et al., 2012; Feng et al., 2013, as cited in Liang & Zou, 2018; Parker & Fischhoff, 2005). From this previous work, researchers have concluded that the Resistance to Framing Scale as part of the A-DMC is a valid and reliable instrument without extensively investigating its performance.

The factor structure underlying the Resistance to Framing scale remains subject to debate. When using the scale, researchers classically average over the absolute difference of the ratings in the positive and negative frame in each scenario (Bruine de Bruin et al., 2007). They thereby implicitly assume that the scale is unidimensional. This assumption is supported by evidence showing that risky-choice framing can be reduced to attribute framing, as systematically different choices in risky-choice framing scenarios result from framing effects of the riskless rather than the risky option (Kühberger & Gradl, 2013). Contrasting with this view, the original typology introduced by Levin et al. (1998) posits that risky-choice and attribute framing are distinct types based on what is framed (options with varying levels of risk vs. characteristics of an object), what is affected (risk preference vs. object evaluation), and how the effect is assessed (comparing choices of options vs. attractiveness of an object)—all of which are expected to influence the effect. Our reanalyses of previous data (Bavolar, 2013; Bruine de Bruin et al., 2007) before conducting this study supported two distinct types (Supplement 1).

The factor structure underlying the scale has not yet been investigated, although such an investigation is important because the standard practice of averaging over all absolute differences is only warranted as long as a one-factor model is empirically supported.

The present study

The present study extensively tested the performance of the Resistance to Framing scale in samples from Bulgaria and North America. This additional assessment contributes to existing studies in three key ways: First, it goes beyond the internal consistency of the overall scale by using exploratory item response theory models. These models allow us to examine each item’s performance and the scale’s measurement precision across different levels of susceptibility to framing. Second, the present study examines the structural and cross-country validity of the scale using confirmatory factor analyses, as recommended in guidelines for evaluating measurement instruments (Chan, 2014). Last, the current results suggest that the scale performs poorly in university student samples and should not be used in its current version—a conclusion that challenges previous assumptions that the scale is valid and reliable.

Method

To assess the performance of the Resistance to Framing scale, we conducted an online survey among Bulgarian and North American university students. We had preregistered our research questions, sampling plan, study design, and data analyses on the Open Science Framework (https://osf.io/uj5rz, September 11, 2020) before data collection. We assure that we reported how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study (as suggested by Simmons et al., 2012). Deviations from the preregistration are reported in Supplement 2.

Sampling plan

Our preregistered target sample size was 250 complete responses per site after excluding responses categorized as duplicate, inattentive, or too fast (see Data exclusions). The data were collected as part of a larger research project to evaluate the dual-process account in Bulgaria and North America (Rachev et al., 2022). The sample size has been determined based on the fit and misfit of the confirmatory factor analysis (CFA) models and expected effect sizes used within this larger project. A priori power analyses for the present study with the R package semTools (Version 0.5-3; Jorgensen et al., 2020) indicated that 250 participants per site allowed us to detect fit and misfit of a one- (Resistance to Framing) and two-factor (risky-choice and attribute framing) model with more than 95% power (Supplement 1).Footnote 1

Participants

Participants were required to be at least 18 years old, citizens and residents of either Bulgaria or North America, native Bulgarian or English speakers, and university students with a non-psychological major. Characteristics of both samples are in Table 1.

Table 1 Sample characteristics (Bulgaria and North America)

Materials

The Resistance to Framing scale (Bruine de Bruin et al., 2007) consists of seven risky-choice and seven attribute framing items, with one positively and one negatively framed version per item. An example item of risky-choice framing is deciding between two investment strategies, as in the introduction of this paper. Participants indicate the strength of their preference for either strategy on a six-point scale with 1 Definitely would choose A and 6 Definitely would choose B. Resistance to Framing is then assessed by the absolute difference between the ratings for the positive and negative frame of every item (Bruine de Bruin et al., 2007), whereby lower scores represent lower susceptibility to framing.Footnote 2 Compared to the original scale, we slightly modified the phrasing of the instruction and two scenarios to make their meaning clearer.

The Bulgarian version of the Resistance to Framing scale was created from the original English version by five (risky-choice framing) and four (attribute framing) independent translators using forward and backward translation. In particular, three (risky-choice framing) and two (attribute framing) experienced translators translated the original items to Bulgarian and discussed discrepancies to arrive at a joint forward translation. After establishing the same procedure for the back translation with two additional translators, all translators and an additional adjudicator (the third author) compared the original version to the back translation and resolved discrepancies. Pilot-testing before data collection did not reveal any concerns about the items or response format.

Procedure

Participants in both sites were recruited from November 2020 to January 2021 via social media, emails, and universities. They were informed about the confidentiality and anonymity of their responses and provided written informed consent before starting the online survey. All individuals participated voluntarily and could determine their participation at any given moment. The study was approved by both local research ethics committees.

Participants completed the 30-min Qualtrics survey in their native language. They first created a withdrawal code and provided basic demographic information (i.e., age, gender, education, field and year of study, country of residence, citizenship, and native languages). Participants were then presented with either the positive or negative version of the risky-choice followed by the attribute framing block before being presented with the remaining versions. The framing problems within a block were presented in randomized order and forced-response format. Between the positive and the negative version, participants took approximately 10 min to complete the Bullshit Receptivity scale (Pennycook et al., 2015), the Actively Open-Minded Thinking scale (Baron & High II, 2019), and an attention check item (“Please select the option ‘neutral’ and proceed to the following question.”), all of which were part of the larger project. The survey ended with a debriefing and redirection to a separate survey where participants’ email addresses were recorded to enter a random draw of 165 (160 × $3 and 5 × $20) gift cards for Bulgarian participants and 90 (85 × $5 and 5 × $30) gift cards for North American participants.

Statistical analysis

All analyses were conducted using R (Version 4.0.2; R Core Team, 2020) with the statistical significance level set to α = .05.

Data exclusions

Participants were included in the analysis if they answered all items, provided informed consent, and completed the survey only once (i.e., unique combination of IP address, withdrawal code, and demographics including age, gender, major, and year of studies; Koo & Skinner, 2005). Participants were further included if they passed the attention check and did not complete the survey faster than 10 min.Footnote 3 In the Bulgarian and North American sample, 30 and 29 multivariate outliers, respectively, were excluded based on the generalized Cook’s distance using the R package influence.SEM (Version 2.2; Altoé & Pastore, 2018). Analyses including outliers are in Supplement 3.

Main analysis

We first inspected the descriptive statistics of each item, including mean, standard deviation, range, skewness, kurtosis, and inter-item as well as item-total correlations. To evaluate the factor structure, we fit a one-factor and two-factor CFA model in each site using the R package lavaan (Version 0.6-6; Rosseel, 2012).Footnote 4 The fit of each model was assessed using several fit indices (Table 5) and their preregistered standard cut-off values.Footnote 5 However, we refrained from comparing the models and from testing measurement invariance between sites, due to the low inter-item correlations (see Results). We used McDonald’s ω to assess the scale’s reliability as well as exploratory, non-preregistered item response theory (IRT) to get further insights into the scale’s measurement precision and the performance of each item. The IRT models were fit with the mirt package in R (Version 1.33.2, Chalmers, 2012) and full-information maximum likelihood estimation to account for the relatively small sample sizes (Forero & Maydeu-Olivares, 2009). We fit a graded response (Samejima, 1969) and a generalized partial credit model (Muraki, 1992), two commonly used polytomous models that apply to the data at hand. These models were compared based on the AIC and BIC as well as their sample-size adjusted counterparts (AICc and SABIC), and their fit was assessed based on multiple indices (Table 6).

Results

Descriptive statistics

Table 2 shows the descriptive statistics for both samples. Susceptibility to framing was generally low, especially for the attribute framing items. This was also supported by the skewness which indicated that all items were highly positively skewed (skewness > 1). Most inter-item correlations were small and non-significant (Tables 3 and 4). Item-total correlations (i.e., correlations between each item and the total score of all items excluding this item) ranged from .09 to .23 in Bulgaria and .04 to .32 in North America, which is below the recommended .30 (Boateng et al., 2018).

Table 2 Descriptive statistics, reliability, and factor loadings of the Resistance to Framing Scale (Bulgaria and North America)
Table 3 Inter-item and item-total correlations (Bulgaria)
Table 4 Inter-item and item-total correlations (North America)

The network plots in Fig. 1 additionally visualize that few inter-item correlations are larger than r = .10. Moreover, we would expect either that all items of the scale are substantially positively correlated (which would point toward a one-factor model) or that the items form two clusters, risky-choice and attribute framing (which would point toward a two-factor model). However, neither pattern seems to be supported.

Fig. 1
figure 1

Network plots of the bivariate correlations between the items of the Resistance to Framing scale for A Bulgaria and North America. Only bivariate correlations larger than r = .10 are displayed. Red edges indicate a negative correlation between two items, whereas green edges indicate a positive correlation

Structural validity

Inspecting the bivariate correlations already revealed that there may be problems with the structure of the scale. We nevertheless proceeded with the preregistered plan and fit one- and two-factor CFA models. The models’ structure was determined based on theory (see introduction) and the classification of the items into risky-choice and attribute framing proposed in the original article (Bruine de Bruin et al., 2007).

The hypothesized one- and two-factor model fit the Bulgarian data acceptably, based on the robust χ2-test, the robust RMSEA, and robust incremental fit indices (Table 5). However, in the one-factor model, two of 14 items (i.e., risky-choice: income tax and attribute framing: cheating) did not load significantly onto the factor. This problem persisted in the two-factor model where the attribute framing item cheating still did not load significantly onto its corresponding factor. More worryingly, all standardized factor loadings (Table 2) were low and below the recommended cut-off of .45 (Tabachnick et al., 2007).

Table 5 Model fit of the one- and two-factor model in the Bulgarian and North American sample

The hypothesized one-factor model did not fit the North American data acceptably (Table 5), and five of the 14 items (i.e., risky choice: cancer, attribute framing: condom, beef, exams, and cancer) did not load significantly onto the factor. While the two-factor model fit well (Table 5), four of the 14 items (i.e., risky-choice framing: cancer, attribute framing: condom, budget, and cancer) still did not load significantly onto their corresponding factor. Once again, the factor loadings were small, and only three out of 14 met the recommended criterion. As such small factor loadings indicate problems at a basic level of conceptualization and may explain the very good model fit (reliability paradox; McNeish & Wolf, 2021), we refrained from comparing the models in each site and from testing measurement invariance between sites.

Reliability

The Bulgarian one-factor model indicated poor reliability (McDonald’s ω = .46), similar to the two-factor model with risky-choice (McDonald’s ω = .31) and attribute (McDonald’s ω = .35) framing. This corresponds to the scale’s reliability in the North American sample (McDonald’s ω = .51) and the two separate scales, risky choice (McDonald’s ω = .50) and attribute framing (McDonald’s ω = .34).

Non-preregistered IRT models

Assumption checks

To get more detailed insights into the performance of each item, we fit an exploratory polytomous graded response and generalized partial credit model. Polytomous IRT models were used, as the Resistance to Framing scale may not be unidimensional (see CFA results). In the Bulgarian sample, both assumptions for fitting IRT models were met: local independence (i.e., items are uncorrelated when conditioning on the latent traits), as evidenced by relatively low discrimination parameters and residual covariances, and monotonicity (i.e., increased probability of endorsing an item constantly goes hand in hand with an increase in the latent trait; Nguyen et al., 2014). In the North American sample, the assumption of monotonicity was violated, with serious violations for the risky-choice item Dropout (crit = 88), unclear violations for five items (40 ≤ crit ≤ 80; attribute framing: condom, beef, cheating, exams, and cancer), and no violations for the other eight items (Molenaar & Sijtsma, 2000, as cited in Crișan et al., 2019). The graphical analyses corroborated these statistical results. As monotonicity was likely violated, no IRT models were fit to the North American data.

Model comparison

Comparing both models showed that the graded response model fit the Bulgarian data better than the generalized partial credit model. As displayed in Table 6, AIC and BIC as well as their sample-size adjusted analogs were lower for the graded response model. The other fit statistics indicated that both models fit the data well, as shown by a non-significant M2* value, a low RMSEA and SRMSR, as well as a high TLI (except for the graded response model) and CFI. The S−χ2 statistic indicated no misfit for any of the items of the graded response model but misfit for the risky-choice dropout item (S−χ2(40) = 59.741, p = .023) when using the generalized partial credit model. We thus selected the graded response model to examine the reliability, test information, and standard errors of the scale.

Table 6 Fit of the graded response and generalized partial credit model (Bulgaria)

In line with the reliability based on classical test theory, the IRT reliability was poor for risky-choice (0.52) and attribute framing (0.52). The test information in Fig. 2 shows how accurately the two subscales measure the latent traits, susceptibility to risky-choice (θ1) and attribute (θ2) framing. More test information thereby means higher measurement precision and lower standard errors of the underlying trait estimate (Nguyen et al., 2014). The test information was high for high levels of susceptibility to risky-choice and attribute framing, whereas it was very low for low levels (panel A). This means that high levels of susceptibility to framing can be measured with low standard errors and thus high precision, whereas low levels cannot be measured accurately (panel B). The test information of single items showed similar patterns: high test information and low standard errors for high levels of susceptibility to risky-choice and attribute framing, while for some items (e.g., risky-choice framing: disease), the test information decreased, and the standard error increased for extremely high levels of the underlying trait. The individual analysis of the test information function curves showed that among all items, the test information was highest for the two risky-choice framing items Disease (panel C) and Soldiers as well as the attribute framing item Parking. The test information and discriminating power (as indicated by a flat slope) were the lowest for the attribute framing item Cheating (panel D).

Fig. 2
figure 2

Test information function curve and test standard error for risky-choice (θ1) and attribute (θ2) framing (Bulgaria). Panel A displays the test information function curve for all items combined. Panel B displays the corresponding test standard error. Panel C and D display the test information function curve for the risky-choice framing item Disease and the attribute framing item cheating

Discussion

This study tested the performance of the Resistance to Framing scale. In an online survey among Bulgarian (N = 245) and North American (N = 261) university students, we planned to examine the scale’s psychometric properties, structural validity, and measurement invariance between sites. However, some of these examinations were not possible because the scale displayed low and mostly non-significant inter-item correlations as well as low item-total correlations. This suggests problems with the scale at a basic level of conceptualization, namely that the items may not represent the same content domain (Piedmont, 2014).

Confirmatory factor analyses suggested that the one- (one Resistance to Framing factor) and two-factor (resistance to risky-choice framing and resistance to attribute framing) model fit the data. However, this good fit may be caused by the low inter-item correlations, low factor loadings, and low reliability (reliability paradox; McNeish & Wolf, 2021). We, therefore, cannot make any conclusions about the factor structure, except that based on the inter-item correlations neither of the structures seemed to be supported.

The reliability was low in both sites, meaning that the individual items are only weakly related to each other and that participants’ observed scores do not reflect their true susceptibility to framing. The scale’s reliability in the present study was especially low compared to previous estimates. This could have resulted from the homogeneous nature of our student samples which displayed relatively low susceptibility to framing, thereby restricting variability and reducing reliability (Hedge et al., 2018). This could have also resulted from the short time interval between the two framing blocks which may have made it easy for participants to recognize that the task pairs are objectively equivalent and respond consistently across frames (but see Aczel et al., 2018). Exploratory item response theory analyses in the Bulgarian sample suggested that the scale can measure high levels of susceptibility to framing relatively accurately, while it is inaccurate for low levels. This means that the scale should be used with caution especially in samples where low susceptibility to framing is expected, for example presumably university students.

In addition, low reliability may result in biased correlation estimates between Resistance to Framing and other measures (Nimon et al., 2012). To correct for measurement error, we recommend computing disattenuated correlations or using latent variable modeling when using the scale. However, we would like to emphasize that this solution does not capture the full effect of measurement error on results. The most optimal way would be to improve the scale’s reliability, particularly since poor reliability has been a pervading problem of the Resistance to Framing scale and also of similar measures (e.g., Stanovich et al., 2016). Recently, a new risky-choice framing scale has become available (Berthet, 2021). Berthet was able to improve the scale’s reliability (Cronbach’s α 0.15 in Study 1 vs. 0.74 in Study 2) by increasing the number of items from four to eight and by slightly tweaking the cover stories in each pair of framing tasks (e.g., an accident caused by an autonomous vehicle vs. a train), while keeping the numbers and overall topic identical. The same approach may be applied to improve the reliability of both risky-choice and attribute framing of the Resistance to Framing scales. A positive result may not only represent good news for framing researchers but also point out that using identical tasks is suboptimal for within-participants designs.

An additional problem concerns the scoring of the scale. The original scoring system, also adopted in similar framing measures (Stanovich et al., 2016), is to average over the absolute differences in ratings of paired framing tasks. However, the framing effect is directional (see also Berthet, 2021), that is, higher risk-seeking is expected in the loss versus the gain frame, and higher approval is expected when an object is described in positive versus negative terms.Footnote 6 Based on theory, it is reasonable to use directional rather than absolute differences. Directional scoring is problematic, however, because inconsistencies in opposite directions cancel each other out, which underestimates susceptibility to framing in individuals who display “negative” framing effects on certain problems. Not surprisingly, Berthet (2021) reported that absolute scoring increased the internal consistency of his risky-choice framing scale relative to directional scoring. This paradox of better theoretical justification leading to less optimal measurement should be kept in mind when using the scale. As LeBoeuf and Shafir (2003) noted, within-participants administration of the same framing tasks measures the ability to stay consistent across frames, not the ability to resist the initial influence of the frame. The choice of the scoring system may, therefore, depend on what should be measured, and which question should be answered.

Limitations

This study has several limitations. Firstly, we found no clear answers to the questions of whether the scale’s structure and measurement invariance, due to low inter-item correlations. Secondly, we did not assess the convergent and discriminant validity of the scale, which would have allowed for a more comprehensive assessment, especially of the psychometric properties. Lastly, generalizability may be limited due to homogenous samples including highly educated, young, and in Bulgaria mainly female individuals. Generalizability may be further limited due to the short time frame between the two administrations and the relatively high—although for online surveys common—dropout rate (Galesic, 2006).

Summary

In sum, this study points to three potential problems with the Resistance to Framing scale. First, low inter-item and item-total correlations suggest that the items of the scale may not represent the same content domain. Second, due to these low correlations, it remains unclear whether the scale’s factor structure is one- or two-dimensional. Last, the low reliability of the scale may prevent accurate measurement of susceptibility to framing, especially among highly resistant individuals. By providing open code and data (https://osf.io/j5n6f), we support researchers in testing the scale with other samples (e.g., general population, different languages and countries) to obtain a comprehensive picture of its performance.