Background

Preferences and perceived similarities or differences between choice alternatives can be evaluated using structured vignettes. There are two prominent methods of constructing such models of medical judgements, each with their own literature and set of advocates. These are conjoint analysis, developed in the 1970s to study preference and choice [1], and judgement analysis, also called social judgement theory, developed in the 1950s from Brunswik's lens model [2, 3]. The two have developed along very different theoretical lines and have developed somewhat different methodology, although there is considerable overlap. Today there is a large number of marketing applications, where the joint effects of multiple product attributes on product choice have been studied. The types of choices include 'ranking', 'rating', and 'discrete choice'.

These methods can be carried forward to the analysis of medical decision making, as medical decisions require judgement under uncertainty. This uncertainty may concern a state, such as the presence of illness, the likelihood of future events, such as those in the natural course of an illness, or the likelihood with which such events may be averted, that is, treatment effects. For many years decision-making research has explored physicians' estimation of probabilities given clinical scenarios [4]. However, there have been concerns whether physicians' probability setting leads to consistent ratings [5]. Moreover, cognitive psychological research shows that physicians do not apply probabilities as suggested by decision-making theory but use their own heuristics to decide [68].

Studies of medical choice and judgement offer a way to elicit the public's, patients' and caregivers' views on healthcare that circumvents probability statements [911]. The technique is gaining widespread use in healthcare and has been applied in different areas for example to establish patients' preferences in the doctor-patient relationship [12], or to determine optimal treatments for patients [13]. Increasingly, discrete choice analyses are being employed to study how physicians weigh clinical information in the diagnostic work-up. In particular, respondents are asked to rank, rate, or choose between simulated clinical cases varying in values of different symptoms along the possibility that this case will have a certain illness or will need a certain treatment. Comparison with the results of clinical studies allows an analysis of potential discrepancies (e.g. undervaluation of signs and symptoms, overvaluation of test results). Moreover, such comparisons with reference data from clinical studies allow linking physicians' behaviour to illness probabilities and therefore allow examining (implicit) decision thresholds.

A considerable number of studies have been published recently. We provide an overview of existing reports, present an inventory of their objectives and methods, and evaluate them using systematic review methodology.

Methods

We defined a study of medical choice and judgement as an investigation in which preferences were elicited in physicians, nurse practitioners or medical students and that allowed the estimation of the relative importance of different characteristics.

Search strategy

We performed electronic searches in Medline, PsychINFO, CINAHL (Ovid®-version). Web of Science (ISI web of Science®) was used to locate studies that cited four key papers [1417]. The last update search was performed on 25/3/2005. The exact search strategy may be obtained from the authors.

Inclusion criteria

Eligible articles for this review had to infer cue or attribute weighting from answers to structured vignettes and had to report on caregivers' decision making.

Data extraction strategy

We developed a data extraction form based on the assessment of three articles [1719]. The form contained twelve items describing a study's salient features of context, design and analysis (for details see Table 1).

Table 1 Salient features of studies included in the systematic review.

Besides some study descriptors such as first author and year of publication, we extracted information on the studies' objectives, the clinical problem, who the decision-maker was, the type of decision/preference (diagnosis, treatment, risk, prognosis, diagnosis & treatment, and other), the number of participants and the authors' aims. The objectives were extracted into five categories: description of preferences in one group of caregivers (1), comparison of two or more groups such as different professions or different levels of competence. (2), assessment of the consistency within caregivers with their actual decisions or their direct rating of the attributes (3), assessment of changes in preferences over time, e.g. after attending a course (4), and comparison of caregivers with guidelines (5a), actual patients' preferences (5b), or the findings of one or more clinical studies (5c). We also registered the number of vignettes, the number of attributes of each vignette and the rationale behind the selection of the attributes. Finally, we documented how participants were asked to respond to the vignettes: rating (yes/no, otherwise), ranking, probability estimates, or discrete choice and the way, if any, in which authors accounted for correlated data in the analysis. We extracted this item because observations resulting from these experiments are typically not independent. Each respondent evaluates each of the vignettes. This makes the data from one respondent more alike than one would expect under the assumption of independence, and therefore standard deviations of the attributes could be underestimated. We searched for any statistical method that allows to adjust the standard errors for the intra-group correlation.

All studies were assessed in duplicate. Discordant scores based on reading errors were corrected. Discordant scores based on real differences in interpretation were discussed and resolved through consensus.

Results

The searches retrieved 2001 records. Full papers of 81 potentially relevant studies were obtained. In total 51 articles did not meet the inclusion criteria and were excluded after reading the full reports, leaving 30 reports published between 1983 and 2005 for evaluation. (See flowchart in the Figure 1) The salient features of included studies are shown in the Table 1.

Figure 1
figure 1

Study flow.

General aspects

Although the first study was published in 1983, 24 studies (84%) were published after 1995.

Twenty-seven out of thirty studies examined decision behaviour of medical experts [15, 1742]. In half of the studies more than one type of respondent was surveyed.

Twenty-eight different medical problems were addressed. Twenty-two (73%) studies examined treatment decisions. Eleven studies (37%) asked for a preferred diagnostic decision, sometimes (6 studies) in combination with a treatment decision.

Objectives

Ten studies (33 percent) aimed at describing decision preferences of specific groups of participants [20, 24, 26, 27, 29, 30, 32, 38, 43, 44] and nine studies (30 percent) described decision preference differences between groups [18, 25, 28, 31, 34, 36, 37, 40, 41]. Three studies explored the consistency of decisions between groups of experts [15, 23, 45] and two studies examined change of preferences after an intervention [22, 35]. Only six studies (20 percent) compared decision behaviour against some sort of empirical reference such as a guideline [21, 39, 42] (n = 3), actual patient data [17, 19] (n = 2) or the result of a clinical study [33].

Design

The median number of attributes was 6.5 (inter quartile range IQR 4–9, range 2–15). In 20 studies (67%) the selection of attributes was based on information like the literature [17, 20, 27, 28, 31, 32, 34, 35, 37, 40, 43, 45] (12 studies), expert opinion (7 studies) or guidelines (1 study). In five studies patient files [15, 21, 24, 33, 41] were used to construct the vignettes.

The median number of vignettes was 25 (IQR 16–32), ranging from 3 to 130.

Authors used several response modes for the vignettes. In eight cases they used more than one response mode. In 23 cases authors used a rating procedure [15, 18, 2025, 2731, 3437, 4045], where respondents had to rate the relative importance of a given vignette or assign a probability (n = 6) to a diagnosis or outcome [3133, 3537]. One study used a ranking design, where respondents had to arrange each of the attributes in descending order of importance [24]. In six studies respondents could reply with a yes/no choice [27, 30, 31, 35, 38, 39]. One study used a conventional discrete choice mode, where respondents, given two or more vignettes, had to select one with the highest likelihood of postoperative recovery [26].

Analysis

Twenty (67%) studies did not correct for correlated data. Consequently, only ten studies applied some statistical procedure to account for this correlation within the data [15, 22, 26, 3237, 41].

Discussion

This review has two main findings. First, studies of medical choice and judgement are regularly used in the medical field to explore healthcare providers' decision behaviour or preferences. Second, we found a broad spectrum of different methods, and both design and analysis were suboptimal in some cases.

Cognitive burden/complexity

One fourth of our studies either contained vignettes with more than nine attributes or compiled sets of over forty vignettes in the same experiment. Empirical evidence showing that these figures are too high is scarce and there is much controversy particularly about the number of vignettes [46]. From a cognitive psychological point of view both figures appear to be very high and could bias the results. This bias typically occurs because respondents are unable to integrate and process large information quantities provided simultaneously, or because respondents lose attention when sifting through too many vignettes. However, evidence suggests that more attributes, more choice options and more vignettes decrease response reliability, but do not bias mean responses [46]. As a rule of thumb, the number of attributes per vignette should not exceed six to eight [4749]. There is much opinion and controversy about maximally allowed number of vignettes, but little rigorous evidence [46]. A re-analysis of 21 commercial studies suggests a maximum of 20 vignettes [48] and a review of discrete choice experiments evaluating healthcare shows that the number of vignettes seldom exceeds 16 [49]. Furthermore, the majority of studies either used a ranking or rating response mode. These two modes imply very strong assumptions about human cognitive abilities making it more likely that measures will be biased and invalid [50]. Consequently, we therefore recommend the choice based approach.

Validity, usefulness of study objectives

In contrast to applications in marketing research where the main topic of a study is to identify opinions regarding a new product, we would be particularly interested to learn about the correctness of care givers' weighting of the value of clinical information in decisions. While there is no normative benchmark for a "correct" product there is usually one in medical judgement if clinical studies are available. For example, if the results of a study on medical choice and judgement showed that physicians consistently attribute high weights to relatively uninformative lab test but instead undervalue the informativeness of cues from clinical examination they would hint at something that needed to be improved perhaps with an educational intervention. Also the method would allow assessing the change in preferences after intervening with educational measures.

Most studies did not compare the attributed weights to some sort of normative benchmark such as the results of a clinical study. We only found one out of 30 studies that actually examined this and another five that used a further normative reference (guidelines or patient files). In absence of a normative benchmark these studies leave it to the reader to approve or disapprove the results. Moreover, assessment of discrepancies between different groups of participants has the problem that these could be explained by different clinical circumstances or other factors rather than group specific differences. On the other hand there are medical situations in which views about optimal choices are controversial. In these situations studies that do not compare caregivers' decision behaviour (or preferences) to some norm may still be useful in that they allow the examination of present opinions.

Statistical model

The majority of studies did not account for correlated data in the analysis. Correlated data occur because each respondent assesses different vignettes. Not accounting for this leads to too small estimates of the standard deviations for an attribute and can mimic a statistically significant association where in fact there is none. Unfortunately, guidelines on the conduct of conjoint analyses have not yet reached consensus about the optimal way to analyse correlated data.

Limitations

What are the limitations of this review? We think that the search and appraisal procedures were reliable. However, sometimes classifications were difficult to make because of unclear descriptions in the article. We did not contact authors to clarify these uncertainties. Second, there have been two prominent methods of constructing linear models of medical judgements, each with their own literature and set of advocates. These are conjoint analysis, developed in the 1970s to study preference and choice[1], and judgement analysis, also called social judgement theory, developed in the 1950s from Brunswik's lens model[2, 3]. In this review we did not make a distinction between the two methods because there is substantial overlap in methodology. Arguably this is a weakness of our study. However, since we were interested in providing an overview of all studies that examined medical decisions of care givers using cue weightings from answers to structured vignettes applying all sorts of different methods, we feel that our approach has its own merit.

Future research

Our review indicates that current applications of conjoint and judgment analysis in the medical field remain suboptimal in some instances. We think that researchers should consider our propositions to ensure internal validity. Moreover we believe that studies investigating care givers' judgements are most valuable if they allow comparisons with some norm and if they include an assessment of deviations from that norm. Our review only found few such investigations. From a more methodological point of view we agree with a statement in a recent editorial that research is required to learn whether individuals do behave in reality as they state in a hypothetical context. [51]

Conclusion

We believe that studies of medical choice and judgement offer many attractive and new insights into medical action. Provided that both methods and application evolve they offer a unique opportunity to improve quality of care.