The ability to accurately judge risk is an important cognitive ability with far-reaching implications for one’s well-being. One important example involves the decision to undergo genetic testing for breast cancer. Numerous costs are associated with genetic testing, including, but not limited to, the cost of the test itself, the potential for genetic discrimination by employers and insurers, possible conflicts among family members, and increased anxiety (e.g., Brewer, Richman, DeFrank, Reyna, & Carey, 2012). Given these stakes, it is necessary to accurately judge risk in order to properly weigh potential costs against potential benefits.

Probability theory serves as the normative model of risk judgment, in which judgments vary on a continuum ranging from 0 (impossible) to 1 (certain). The axioms of probability theory ensure that risk judgments are coherent (internally consistent), whereas careful measurement and sound methodology can provide reasonable calibration (external validity) (Reyna & Adam, 2003). Although there are clear benefits of using probability as a normative model, pervasive judgment biases suggest that probability theory is a poor descriptive model of judgment and may be of limited use as a prescriptive model. Some examples of judgment biases include the conjunction fallacy (Wolfe & Reyna, 2010), base-rate neglect (Barbey & Sloman, 2007; Reyna & Brainerd, 2008), and the lack of semantic coherence in conditional probability judgments (Fisher & Wolfe, 2011; Wolfe, Fisher, & Reyna, 2012). These problems are further exacerbated by poor numeracy, defined as the ability to reason with basic quantitative concepts (Reyna, Nelson, Han, & Dieckmann, 2009).

For the reasons cited above, the probability scale is a poor response mode for assessing subjective risk judgment (see also Haase, Renkewitz, & Betsch, 2013). As a prescriptive alternative, we propose a gist-based response mode consisting of the ordinal categories low, medium, and high. The use of ordinal gist categories is theoretically grounded in fuzzy trace theory (FTT; Reyna, 2008), but it may be compatible with other theoretical orientations and has clinical applications. In the remainder of this article, we briefly describe FTT as it applies to risk judgment, the development of materials defined by externally validated risk categories, and an arguably underutilized signal detection model that can be applied to more than two response categories. We end with a discussion of the approach within the larger domain of risk judgment and a discussion of the assumptions of the model.

Fuzzy trace theory

FTT is a theory of memory that has implications for risk judgment (Reyna, 2008). According to FTT, memory is multifaceted, with multiple representations ranging from verbatim to gist. In FTT, the terms “verbatim” and “gist” retain much of their colloquial meaning. Verbatim refers to exact, surface-level details of risk, whereas gist refers to its qualitative meaning, including one’s affective response. One tenet of FTT is that people prefer to reason with the most gist-like representation that is applicable to a given situation. In terms of risk judgment, a mental representation consisting of an exact numerical probability would be located at the verbatim end of a continuum, whereas risk present/absent would be located at the gist end of the continuum. FTT suggests that the ordinal gist categories “low,” “medium,” and “high” reflect a level of resolution frequently used by laypeople when assessing levels of risk (Reyna, 2012).

Signal detection theory

Signal detection theory (SDT) is a formal framework for assessing performance in discrimination and categorization, which has been successfully applied in psychophysics and memory research (Wickens, 2002). According to SDT, a quantity such as subjective risk can be represented as an underlying continuum upon which a response criterion is set to define response categories. The process is error-prone and can accordingly be represented as overlapping distributions, as is shown in Fig. 1. The benefit of using SDT is that it can disentangle two important aspects of judgment: discriminability and the judgment criterion. Discriminability, as measured by d′, refers to the ability to differentiate risk categories. The parameter d′ is defined as the standardized difference between the distributions and conceptually represents the degree of overlap between the distributions. A d′ value can be compared with chance performance and perfect performance as benchmarks to aid in interpretation. At one extreme, d′ equals 0 when performance is at chance levels. At the other extreme, d′ approaches infinity as performance increases.

Fig. 1
figure 1

Depiction of signal detection with two responses

Criterion refers to the threshold that separates one response category from another. In Fig. 1, the black vertical line represents the criterion. It is often measured with respect to the intersection of the distributions, which indicates equal weighting of two possible errors: misses (e.g., failing to detect the presence of risk) and false alarms (e.g., incorrectly stating the presence of risk when it is absent). This is known as response bias, denoted c′. Values of d′ and c′ are estimated in the model through the unique combination of hit rates (correctly identifying the presence of risk) and false alarm rates.

Figure 1 is an example of the two most common SDT paradigms: the yes–no experiment and two-alternative forced choice. The common feature between these paradigms is the use of binary responses. These paradigms can be generalized to a k-alternative identification paradigm with k(k – 1)/2 measures of discriminability and k – 1 measures of response bias (Wickens, 2002 p. 124). In the present article, we describe a three-alternative identification model. As is shown in Fig. 2, the three distributions correspond to the ordinal gist categories of low, medium, and high risk. As we previously noted, FTT suggests that people represent risk at the resolution of three ordinal gist categories. The model contains three parameters for discriminability, one for each of the three pairwise comparisons between distributions. These are dlm, dlh, and dmh, where l, m, and h denote “low,” “medium,” and “high,” respectively; dlm measures discriminability between the low- and medium-risk categories, dlh measures discriminability between the low- and high-risk categories, and dmh measures discriminability between the medium- and high-risk categories. In addition, the model has two parameters for response bias, one that separates low from medium judgments, and a second that separates medium from high judgments, denoted clm, and cmh, respectively. For simplicity, the model assumes equal variances.

Fig. 2
figure 2

Depiction of the three-alternative identification model. Gray areas represent a hit for each of the risk categories. The left vertical black line is the criterion separating low- from medium-risk responses, and the right vertical line is the criterion separating medium- from high-risk responses

Simple binary SDT models, such as yes–no and two-alternative forced choice, have been used extensively in the literature. Tutorials (Stanislaw & Todorov, 1999) and spreadsheets (Sorkin, 1999) have been devoted to their use. By contrast, the k-alternative identification model has received considerably less attention in the literature. In this article, we build upon previous work by describing the computational details and implementation of the theoretically motivated three-alternative identification model, which can easily be extended to accommodate an arbitrary number of categories. The model can easily be implemented in standard programs such as Microsoft Excel, MATLAB, and R, using cumulative normal distribution functions and an optimization algorithm. We provide a ready-to-use spreadsheet for the three-alternative identification model that can be found in the supplementary materials available on the Behavior Research Methods website. Unlike previous efforts, our spreadsheet also includes several useful features, such as the standard errors of the parameter estimates, hierarchical model comparisons for testing differences in the parameters, and posterior probability approximations using the Bayesian information criterion (see Appendix A for computational details).

Research materials

An important prerequisite for using SDT is the development of stimuli that fall into objectively defined categories, a difficult but not impossible task that lies outside of psychophysics. Genetic risk level in this study was validated with the Pedigree Assessment Tool (PAT; Hoskins, Zwaagstra, & Ranz, 2006). The PAT estimates genetic risk on the basis of empirically verified risk factors, including family history of breast cancer and ethnicity. Ordinal gist categories were defined using cutoff values for the PAT (see Appendix B). Importantly, the cutoff values were also vetted by a nationally recognized medical expert in women’s health and clinical decision-making, as defining low-, medium-, and high-risk categories.

On the basis of these defined categories, we developed 12 cases of hypothetical women who varied in genetic risk for breast cancer. These included four low-, medium-, and high-risk cases. A complete listing of the 12 cases can be found in Appendix B. Careful attention was given to the development of tightly controlled and standardized cases. Each case included the following information: name, age, ethnicity, hometown, and family and personal health history. Given that age is a strong, nongenetic predictor of breast cancer, age was equated across the risk categories. The 12 cases were also equated in terms of word length and linguistic complexity, as measured by the Flesch–Kincaid Grade Level score and the Flesch Reading Ease score. This precluded the possibility that higher risk judgments would superficially be given to scenarios that contained more words, were more difficult to read, or were more jargon-laden.

An empirical study

We used the SDT model with these 12 breast cancer risk cases to test the efficacy of the Breast Cancer Genetics Intelligent Semantic Tutor (BRCA Gist). BRCA Gist is an intelligent tutoring system created using AutoTutor Lite (Hu et al., 2009) to teach women about genetic breast cancer risk (see also Graesser et al., 2004). AutoTutor Lite is a Web-based instantiation of AutoTutor, which has been implemented successfully in knowledge domains as diverse as physics (Jackson, Ventura, Chewle, Graesser, & the Tutoring Research Group, 2004), computer science (Craig, Sullins, Witherspoon, & Gholson, 2006), and behavioral research methods (Arnott, Hastings, & Allbritton, 2008). AutoTutor Lite has shown some successes in improving coherence in probability judgments (Wolfe, Fisher, Reyna, & Hu, 2012). BRCA Gist consists of a human-like avatar that communicates to the user verbally and can provide information by means of a variety of multimedia channels, including spoken and written text, video, and graphics. During tutorial interactions, BRCA Gist poses questions, and the user responds by typing into a dialog box. Throughout the tutorial interaction, BRCA Gist provides feedback and encouragement, prompting the user to elaborate on her answers. In combination with content that focuses on bottom-line gist, this process of tutorial interaction is used to increase learning.

In this experiment, 200 women were randomly assigned to either BRCA Gist (N = 68) or one of two comparison conditions: the National Cancer Institute (NCI; N = 65) website, or an irrelevant nutrition control group (N = 67). To control for time on task, the control and NCI conditions had a completion time of 60 min, which is comparable to the length of the BRCA Gist tutorial. The nutrition control group received irrelevant nutrition information, and thus served as a point of reference to assess learning gains. The NCI group read information related to genetic risk of breast cancer from the NCI website. We made PDFs of 26 NCI web pages and hosted them on the experimenter’s server. Thus, participants read and saw all of the relevant content and navigated among pages with the aid of a navigation bar, but they could not follow any hyperlinks. This also prevented potential changes in content during the experiment. The NCI group allowed us to compare the efficacy of BRCA Gist to that of the NCI website, an information source developed by experts on genetic risk of beast cancer that was the major source used in the development of BRCA Gist.

BRCA Gist covered many of the same topics covered in the NCI group, but it also included graphics and videos designed to help participants develop the appropriate gist representation of key concepts such as the relationship between BRCA mutations and breast cancer in the population of American women. The BRCA Gist materials were vetted by a medical expert. An animated avatar presented information verbally, with key concepts being presented concurrently in text. Women engaged in five tutorial interactions throughout the experiment. During the interactions, BRCA Gist posed a question such as “What should someone do if she receives a positive result for genetic risk of breast cancer?” and women typed responses into a dialog box. BRCA Gist provided feedback and encouraged elaboration on the basis of the relevance and thoroughness of the answers. See Wolfe et al. (2013) for a more thorough discussion of these tutorial dialogues.

After completing the learning phase, participants judged the genetic breast cancer risk for each of 12 randomly ordered cases as a measure of distant transfer—that is, their ability to apply what they have learned to hypothetical cases. On each trial, participants read one randomly selected case and categorized it as low, medium, or high risk. Other measures were collected, such as a declarative knowledge assessment, the State–Trait Anxiety Inventory (Spielberger, Gorsuch & Lushene, 1970), and the PAT. However, a detailed discussion of these measures is beyond the scope of this article.

Results

Responses were organized into the nine response categories formed by a 3 × 3 confusion matrix (see Appendix A) and aggregated for each group (see Table 1). Model fitting and model comparison were performed using the spreadsheet developed by the authors, which can be downloaded from the Behavior Research Methods website. The spreadsheet includes the data from the present study and a worked example. The model fits for the nutrition control, G 2(df = 2) = .71, p = .70, and NCI, G 2(df = 2) = .79, p = .67, conditions were good. The model departed from the data in the BRCA Gist condition, G 2(df = 2) = 16.86, p < .001. However, the magnitude of departure was very small, w = .14. On this basis, the model is arguably a satisfactory fit. Figure 3 shows the best-fitting parameter estimates of d′ and c′ for each condition.

Table 1 Observed and expected response probabilities by condition
Fig. 3
figure 3

Group comparisons of d′ (top panel) and c′ (bottom panel). Error bars represent standard errors. Groups marked by different letters are statistically different at p < .05

A hierarchical model comparison was used to test the pairwise differences in parameters for the BRCA Gist, NCI, and the control conditions (see Appendix A for computational details). BRCA Gist showed increased discriminability for all risk categories, relative to the nutrition control. As compared with the nutrition control, NCI showed increased discriminability only for dlm and dlh. However, BRCA Gist was not statistically different from NCI (see Table 2). These results suggest that both BRCA Gist and NCI generally improved women’s ability to discriminate among levels of genetic breast cancer risk, but that differences were slightly more robust for BRCA Gist (but not statistically better than reading the NCI website). In addition, we found no differences in response bias between the groups for either measure. Taken together, these results suggest that the increases in performance for both BRCA Gist and the NCI website were due to improved risk discrimination, rather than a simple shifting of response criteria.

Table 2 Pairwise comparisons for each d′ and c′ parameter

Discussion

SDT is a useful formal framework for assessing performance in discrimination and categorization tasks for which objective categories can be defined. We demonstrated that it is possible to apply SDT to perceptions of risk. With the aid of the PAT and expert medical judgment, we were able to estimate the risk of individuals (presented as case-based scenarios) with reasonable accuracy. Moreover, the case materials in Appendix B were standardized in terms of several dimensions, including word length, linguistic complexity, and nongenetic risk factors for breast cancer, such as age.

One major advantage of using SDT, as opposed to simple percentages correct, is that it can disentangle risk discriminability from the response bias (i.e., decision threshold) on which judgments of risk are based. In this experiment, BRCA Gist, an intelligent tutoring system, and the NCI website both increased women’s ability to discriminate genetic risk for breast cancer, although BRCA Gist supported differentiating low, medium, and high risk. However, no differences in response bias emerged among any of the groups, suggesting that BRCA Gist and NCI do not appreciably alter how women weight errors (misses vs. false alarms). Thus, participants did not improve accuracy by simply having a more strict or lenient decision criterion. The lack of a change in response bias is a desirable outcome, considering that it is a subjective component of risk judgment. Without a mathematical model, such as SDT, it would have been impossible to isolate these components of risk judgment.

To illustrate the utility of the SDT model, it has informed our ongoing efforts to develop and improve the BRCA Gist tutorial. For example, BRCA Gist improved declarative knowledge of genetic breast cancer risk relative to the NCI website (results not reported here). A model-based analysis indicated that this gain in knowledge did not necessarily transfer to the more distal task of risk judgment. Bearing these results in mind, we are modifying BRCA Gist to place greater emphasis on conveying the gist of risk. Alternatively, BRCA Gist could provide training in risk judgment with explicit feedback.

Another advantage of SDT is its flexibility. By far the most common paradigms are the yes–no and two-alternative forced choice paradigms, each involving simple binary choice. However, SDT can be generalized to a larger number of categories. As a result of this flexibility, we were able to model risk judgment using three ordinal gist categories—low, medium, and high—that were theoretically grounded in FTT.

One potential issue is that the assumptions of the model may not hold perfectly, a common problem in mathematical modeling. In particular, the model assumes normality, equal variances, and the unidimensionality of risk. We have several grounds for believing that these assumptions are reasonable, even if not fully satisfied. First, the model provided a good fit to the data in the nutrition control and NCI groups. Adding more parameters would have diminishing returns, and would likely decrease the generalizability of the model to replications of the same experiment (Pitt & Myung, 2002). Although the model did not fit the BRCA Gist group as well, the magnitude of the discrepancy was small. However, given the relatively high value of G 2, the finding that the BRCA Gist tutorial increased sensitivity without affecting response bias—that is, the independence of d′ and c′—must be interpreted with caution. Ultimately, further investigation will be needed to refine the model and evaluate the tenability of its assumptions. What we have provided here is the initial computational and methodological groundwork for extending SDT into the domain of risk judgment.

Our theoretical rationale for using three ordinal gist-based risk categories is motivated by FTT and by research supporting the psychological reality of ordinal categories. However, the approach that we have outlined is compatible with other theoretical frameworks. Moreover, FTT does not limit its potential clinical application; on the contrary, FTT has been tested empirically in other domains of health and medical decision making (e.g., Reyna & Lloyd, 2006). Although the present study involved genetic risk of breast cancer, our approach is broadly applicable to other risk domains and can provide insight into peoples’ ability to understand and judge risk. Risk communication, especially communicating meaningful gist, is particularly important in light of the shift toward patient-centered care, in which patients assume more involvement in the decision-making process (Elwyn et al., 2012; Reyna et al., 2009).