Quality of Life Research

, Volume 18, Issue 4, pp 509–518

The precision of health state valuation by members of the general public using the standard gamble


    • Peninsula Technology Assessment Group, Peninsula Medical SchoolUniversity of Exeter
  • Matthew Dyer
    • Peninsula Technology Assessment Group, Peninsula Medical SchoolUniversity of Exeter
  • Ruairidh Milne
    • University of Southampton
  • Alison Round
    • Peninsula Technology Assessment Group, Peninsula Medical SchoolUniversity of Exeter
  • Julie Ratcliffe
    • University of Sheffield
  • John Brazier
    • University of Sheffield

DOI: 10.1007/s11136-009-9446-6

Cite this article as:
Stein, K., Dyer, M., Milne, R. et al. Qual Life Res (2009) 18: 509. doi:10.1007/s11136-009-9446-6



Precision is a recognised requirement of patient-reported outcome measures but no previous studies of the precision of methods for obtaining health state values from the general public, based on specific health state descriptions or vignettes, have been carried out. The methodological requirements of policy makers internationally is driving growth in the use of methods to obtain utilities from the general public to inform cost per quality-adjusted life-year (QALY) analyses of health technologies being considered for adoption by health systems.


The precision of five comparisons of the outcomes of treatments, based on health state descriptions, was assessed against the results of clinical trials which showed a statistically and clinically significant improvement using an internet panel of members of the UK general public. Health states were developed to depict the baseline and post-treatment states from these exemplar clinical trials. Preferences for health states were obtained using bottom-up titrated standard gamble over the internet, and differences between summary health state values corresponding to the treatment and comparator groups within each exemplar study were compared. Results are considered in the context of various estimates for the minimally important difference in utility values.


Participation among members of the internet panel in the five exemplars ranged from 27 to 59. In four of the five exemplars, the utility-based estimates of treatment benefit showed significant differences between groups and were greater than an assumed minimally important difference of 0.1. Mean utility differences between groups were: 0.23 (computerised cognitive behavioural therapy for depression, P < 0.001), 0.11 (hip resurfacing for hip osteoarthritis, P < 0.001), 0.0005 (cognitive behavioural therapy for insomnia, P = 0.98), 0.15 (pulmonary rehabilitation for COPD, P < 0.001) and 0.11 (infliximab for Crohn’s disease, P < 0.001). The confidence intervals around the estimates of utility-based treatment effect in three of the five examples did not exclude the possibility of a difference smaller than a minimally important difference of 0.1. Recent empirical evidence suggests a lower minimally important difference (0.03) may be more appropriate, in which case our results provide further reassurance of preservation of precision in health state description and valuation.


The precision of estimates of treatment effects based on preference data obtained from disease-specific measurements in clinically significant studies of health technologies was acceptable using an internet-based panel of members of the general public and the standard gamble. Definition of the minimally important difference in utility estimates is required to adequately assess precision and should be the subject of further research.




Precision [1, 2], sometimes called sensitivity [3, 4], is the extent to which numerical values generated by an outcome measure relate to the underlying distributions of patients’ experience of health status. Precision is important in measures used in clinical trials to identify important differences between treatments being assessed, although it may be limited by a range of measurement challenges including: choice of items in the scale, response categories and ceiling or floor effects.

The concept of precision is also applicable to preference-based measures of outcome, whether these are provided by patients themselves, proxies or members of the general public. There has been substantial growth in interest in the last of these categories in many parts of the world, following increased use of cost-effectiveness analyses that employ the quality-adjusted life-year (QALY) as an outcome metric. Most international guidelines for the preparation of such analyses stipulate that the quality weights in calculating QALYs should be obtained from members of the general public, i.e. non-patients. This creates a two-stage challenge to analysts: to identify and describe relevant health states from patients and then to obtain the values of non-patients to be applied to these in calculating QALYs.

Measures such as the EuroQol instrument (EQ-5D) or the Health Utilities Index (HUI) provide simple generic descriptive systems for which values have been obtained from large, representative general population samples [5, 6]. However, these measures have not been applied ubiquitously in studies of new or existing health technologies. There is therefore increasing interest in measuring the effects of treatments in terms of preferences which can be used to estimate benefits in terms of QALYs. There are, broadly, three approaches which may be tractable in the often limited time available to analysts carrying out evaluations to inform national policy: mapping between patient-reported outcome measures used in clinical trials and public preferences for health states (e.g. as known through the HUI or EQ5D); creation of new preference-based instruments from existing patient reported measures; and using vignettes to describe key health states required in economic evaluations of competing treatments. In each case, the objective is the same: to characterise the differences shown on a non-preference-based measure in terms of public stated preferences for relevant differences in health status as demonstrated in clinical trials.

In this study we explore the precision of a vignette-based approach to characterise the benefits apparent from clinical trials (using non-preference-based patient-reported outcomes) that might inform cost per QALY analyses. In other words, whether important differences in health status shown in clinical trials, when presented to the public as health state descriptions, results in clear and important differences in estimates of utility.

It is therefore important to consider the extent to which a difference, on either a preference- or non-preference-based measure, is clinically important. There has been considerable interest in defining minimally important differences for a range of outcome measures. Although the literature exploring this in terms of gains in utility remains limited, our findings are discussed in this context.

The context for the present study was a larger project which explored using the internet to obtain health state values from members of the public as a means of rapidly meeting the stated requirements of national policy makers [7].


The Value of Health Panel Project, within which the current study was nested, was established as a “standing group” of members of the public who provided their values on written health state descriptions presented via the internet (www.valueofhealth.org).

An interactive website was used to elicit values from respondents using a bottom-up titration variant of the standard gamble technique. Evidence on search procedures in the standard gamble does not indicate a preferred method. The iterative approach (“ping pong”) has been shown to result in lower utilities than titration in several studies [810]. However, Hammerschmidt et al. [11] report lower utilities with top-down titration and demonstrated no consistent or significant differences between utilities obtained by top-down and bottom-up variants. Brazier and Dolan [12], in a study of titration with self-completion versus iteration with face-to-face interviews, showed that iteration may be more prone to anchoring bias and that some respondents found it confusing and reported that it encouraged them to take risks. This finding informed our decision to use titration in this study, given that respondents would be carrying out the procedure on their own.

The Value of Health Panel was recruited from a sample of the UK adult general public, identified from electoral registers in four cities (Exeter, Sheffield, Glasgow, and Aberdeen) with sampling stratified by socioeconomic status. A total of 5,320 people were approached, of whom 2.1% expressed interest in participation. This very low level of response is similar to other internet-based preference studies [13]. Those interested in participation, and able to attend for training, were familiarised with the project and the method of preference elicitation in a series of workshops held in each city. A total of 112 people were trained in carrying out the standard gamble and constituted the final panel. The panel was not representative of the UK general population, having higher proportions of people in middle age, married people and those who had completed higher education, and lower proportions of people from socioeconomically deprived areas and from ethnic-minority populations. Recruitment and retention in the panel has been reported in detail elsewhere [7].

For the evaluation of precision, five sets of health state descriptions were developed and presented to the panel. These were derived from the results of studies of interventions which (a) used patient-based outcome measures suitable for description as health states and (b) demonstrated statistically and, in the clinical trial authors’ opinion, clinically significant treatment effects.

The following exemplars were used to investigate the precision of health state description and valuation: computerised cognitive behavioural therapy (CBT) for depression, metal-on-metal hip resurfacing for osteoarthritis (OA), CBT for insomnia, pulmonary rehabilitation for chronic obstructive pulmonary disease (COPD) and infliximab for Crohn’s disease. Health state descriptions were developed by two clinical researchers (K.S. and M.D.) and presented as a series of bullet point statements in the second person. State descriptions were placed on the project website (www.valueofhealth.org) for periods of 3 weeks with email reminders to participants to encourage valuation, and provision of a small financial lottery as incentive. The methods used in developing health state descriptions to reflect the results of exemplar studies are now outlined, with additional details and the vignettes shown at Appendix 1.

The computerised CBT for depression exemplar was based on the parallel group randomised controlled trial (RCT) by Selmi et al. [14]. This showed a significant difference of 7.5 points between treatment and control groups on the Beck Depression Inventory (BDI) [15]. BDI is commonly categorised into four grades of severity: minimal (1–13 points), mild (14–19 points), moderate (20–28 points) and severe (>28 points). At baseline in Selmi et al., both groups had moderate depression (22 points). After treatment, both groups improved, with the CBT group being classified as having minimal depression (10.3 points) and the control group as having mild depression (18.5 points). Health states were therefore developed for the BDI categories of moderate, mild and minimal depression, targeting the mid-point in the ranges within each BDI category. The vignettes were reviewed by a consultant psychiatrist and a family physician, who confirmed that they depicted the target condition and appropriately described levels of severity.

Vignettes for the exemplar of metal-on-metal hip resurfacing for osteoarthritis (OA) were developed from an uncontrolled study by McMinn et al. [16] using the Charnley Hip Score, which includes items on pain and mobility [17]. Patients describe their condition by indicating which statement best describes themselves. Baseline average score (measured for three variants of the hip prosthesis) in McMinn et al. was 11 of possible 12. Following treatment there was significant improvement to 6 of 12.

The OA vignettes were reviewed by three sufferers with the condition, identified through a UK charity (the Arthritis Foundation), using a simple visual analogue scale (VAS) for credibility, anchored on “0 = describes condition very poorly” and “100 = describes condition very well”. Average response was 53%, with the postoperative vignette being rated at 33% and baseline 68%.

In the example of CBT for insomnia, vignettes were based on the Pittsburgh Sleep Quality Inventory (PSQI), as employed in a parallel group RCT by Morgan et al. [18] of CBT in people on long-term hypnotic use. The PSQI includes domains of: subjective sleep quality, sleep latency, sleep duration, habitual sleep efficiency, sleep disturbances, use of sleeping medication and daytime dysfunction [19].

Statements were developed to reflect the adjusted mean PSQI scores at baseline and at 12 months follow-up from the treatment group in Morgan et al. using the descriptors of severity used in the PSQI. In this example, there was no significant difference between the baseline (PSQI 12.3) and follow-up states (PSQI 12.7) in the control group. The analysis of precision in this case was therefore calculated as the difference between CBT at baseline and follow-up in the CBT group.

The COPD vignettes were based on the Chronic Respiratory Disease Questionnaire (CRDQ), as used in a trial of community-based pulmonary rehabilitation [20]. This demonstrated a 26.9-point absolute difference between groups. At baseline, the two groups (usual care and rehabilitation) were not balanced. Therefore four health states were developed to depict baseline and post-treatment. The CRDQ includes items on dyspnoea, fatigue, emotional impact and mastery of breathing [21]. The vignettes were reviewed by two consultant chest physicians. They rated the credibility of the vignettes as representations of the target condition as low, with a mean of 31.5% using a VAS anchored as described above.

The Crohn’s disease exemplar was based on ACCENT I, a RCT of infliximab which used the Inflammatory Bowel Disease Questionnaire (IBDQ) as outcome measure [22, 23]. The outcome of interest was the proportion of patients achieving remission (IBDQ score of 170, symptoms “hardly any of the time”) from a baseline score of 127. Baseline and remission state descriptions were therefore developed.

Depression, osteoarthritis and insomnia were identified from a review of studies published by the UK National Health Service (NHS) Health Technology Assessment (HTA) programme [24]. COPD was identified as a topic for precision analysis from review of papers published in the British Medical Journal in 2004.

Analyses were carried out in SPSS (version 11). For the three sets of descriptions based on RCTs, the utility estimates for the differences between baseline and follow up were compared across treatment groups using the paired t-test and Wilcoxon signed-rank test. In the trial of infliximab for Crohn’s disease the outcome was the proportion of patients achieving remission, so precision is estimated using the difference in this proportion between baseline and remission states.


Table 1 shows the results for the analysis of precision. The outcome measures, baseline and follow-up data for the clinical trials which formed the basis for the analysis are shown. Mean and median preferences are reported for the differences in health states corresponding to comparisons relevant to each clinical trial.
Table 1

Results of precision analysis

Health technology

Clinical trial

Value of health panel

Outcome measure




Difference in mean utility between groups (95% CI)

Difference in median utility between groups

CCBT for depression

Beck Depression Inventory

CCBT: 22.9

Control: 21.4

CCBT: 10.3

Control: 18.5


0.23 (0.16–0.30)

P < 0.001a


P < 0.001b

Hip resurfacing for osteoarthritis

Charnley Hip Score (CHS)




0.11 (0.08–0.15)

P < 0.001a


P < 0.001b

CBT for insomnia

Pittsburgh Sleep Quality Index

CBT: 12.8

Control: 12.3

CBT: 9.5

Control: 12.7


0.0005 (−0.03 to 0.03)

P = 0.98a


P = 0.02b

Pulmonary rehabilitation for COPD

Chronic Respiratory Questionnaire

Rehab: 53.6

Control: 63.6

Rehab: 90.0

Control: 73.1


0.151 (0.08–0.22)

P < 0.001a


P < 0.001b

Infliximab for Crohn’s disease

Inflammatory Bowel Disease Questionnaire

Baseline = 127

Remission = 170


0.11 (0.06–0.17)

P < 0.001a


P < 0.001b

aPaired t-test

bWilcoxon signed-rank test

In four of the five examples, significant differences in outcomes demonstrated in the original trial were preserved in the values obtained from the panel. In these four, the estimated difference in utility attributable to treatment was greater than 0.1, which was taken as being clinically significant, following the basis for the sample size calculation undertaken to estimate values for the EQ5D [25].


This is the first evaluation of the precision of valuations by the general public of hypothetical health state descriptions based on disease-specific quality-of-life measures.

In the case of insomnia, the difference in utility between treatment groups was very small and is probably not clinically significant. There are several points worthy of note about this example. In the original trial, the control group worsened slightly over the course of the trial (difference in PSQI = 0.4) but this difference was not captured in the health state descriptions: it was assumed that the control group did not change over the course of the trial. It is also likely that the process of health state description development resulted in some attenuation of the difference between PSQI scores between baseline and follow-up. In the PSQI, two domains were excluded from health state descriptions. The habitual sleep efficiency domain is derived from responses to other items in the scale and could therefore not be included. Overall sleep quality is assessed in the PSQI using one question (“During the last month how would you rate your sleep quality overall?”). This item was excluded as it represents a subjective summary of sleep quality which the other items included in the description are intended to evoke in panel respondents.

The health state descriptions are very similar, although differing in four of the five statements (Table 2, with differences highlighted in bold).
Table 2

Insomnia health state descriptions


Following CBT

• It usually takes you between 30 and 60 min to fall asleep at night but sometimes you lie awake for less than 30 min

• It usually takes you between 30 and 60 min to fall asleep at night but often you lie awake for less than 30 min

• Usually you sleep for less than5 h a night

• Usually you sleep for 5–6 h a night

• Less than once a week your sleep is disturbed and you wake up at night

• Less than once a week your sleep is disturbed and you wake up at night

• You have to take medication once or twice a week to help you sleep

• You have to take medication less than once a week to help you sleep

• During the day you sometimes lack energy and enthusiasm to get things done and two or three times a month you have difficulty staying awake when driving, eating or socialising

• During the day you sometimes lack energy and enthusiasm to get things done and less than once a week you have difficulty staying awake when driving, eating or socialising

The clinical trial of CBT for insomnia used a range of outcome measures (SF36, SF6D and Hospital Anxiety and Depression Scale). Statistically significant improvements were shown in the SF36 domains of vitality, physical functioning and mental health between treatment and control groups at follow-up. These factors are not captured by either the PSQI or the health state descriptions, but may be important for assessment of utility. Indeed, the authors of the HTA on CBT used the SF6D to estimate utility (and hence cost utility) associated with CBT and found a difference between groups of 0.12. Whether this utility could have been demonstrated in the panel is unclear, although it seems likely that the consequences of the symptoms described using the PSQI on more generic aspects of quality of life such as are measured using the SF36 (and SF6D) may not have been apparent to respondents. In this respect, the example demonstrates possible limitations on using disease-specific quality-of-life measures as much as the potential for introducing imprecision in the process followed here.

Our evaluation of precision is limited by sample size, which was determined by the number of panel members who contributed to the study and, not surprisingly, reduced over the 18-month duration of the project. It may also be limited by the approach taken to validation of the health state descriptions developed. Validating descriptions with people who had the conditions of interest was initially considered to be an important aspect of the study, and the vignettes depicting hip osteoarthritis were presented to members of a UK charity for this condition. However, there were considerable practical difficulties in identifying patient respondents and, more importantly, it was not clear whether the respondents identified had experience of the health state that was being depicted. For this reason, despite the low ratings by patient representatives of credibility, the vignettes were not adjusted. We subsequently turned our attention to using clinicians to rate the credibility and comprehensiveness of health state descriptions. However, concerns may be raised of clinician experience with the particular health state, and their ability to act as a proxy for patients. More fundamentally, where a health state description is based on a patient-reported outcome measure which has good validity and depicts the precise state of interest, an argument can be made against clinician validation of health states. Where clinicians alter a description it may become a less valid representation of the health state of interest. This argument is stronger where descriptions are based on the analysis of individual patient data than here, where summary measures were used and item reduction was based on clinical judgement.

A further limitation of the precision analysis is that, although the authors of the clinical trials that provided the exemplars asserted that the differences shown were clinically significant, it is not clear whether the changes observed in the exemplar trials exceeded the minimally important difference (MID) for each measure. This notion of MID also applies to the level of utility difference which is considered to be clinically significant. The choice of 0.1 was based on the sample size calculation used in the MVH study, although this was arrived at pragmatically rather than by any consideration of the importance of any utility interval [25].

In the context of cost per QALY analyses, the notion of a minimally important difference is complicated by the relationship between outcomes, costs and willingness to pay for an additional QALY [26]. Nevertheless, when considering the precision of utility measures (such as EQ5D, SF6D or HUI), or as here, a process for estimating utility based on descriptions of patient-reported outcomes, it is important to consider whether the process of measurement can detect changes that are important, i.e. minimally important differences (MID).

MID is defined as “the smallest difference in score in the domain of interest which patients perceive as beneficial and which would mandate, in the absence of troublesome side effects and excessive cost, a change in the patient’s management” [27]. The concept has been extended to include consideration of the balance between benefits and harms [28], and other definitions exclude mention of costs [29].

There is no gold standard for estimating the MID for a measure. Broadly, the MID may be calculated using anchor-based or distributional approaches [29]. Only two studies have attempted to calculate MID in terms of health state values, one carried out using the SF6D, a preference-based measure derived from the SF36 [30], and the other based on the HUI2 [31]. For the SF6D, an anchor-based approach was based on the global health question in SF36 (not included in the SF6D). Nine longitudinal studies which had observed a change in status equivalent to the MID for the SF36 were analysed in terms of the SF6D. The mean MID across the various groups studied was around 0.03 [30]. The study based on the HUI2, carried out in a single group of patients with hydrocephalus, reported the same figure [31].

If 0.03 were taken as the MID in utility terms, then conclusions based on the central estimates of precision in the present study would remain unchanged, but the statistical precision of our results as demonstrating acceptable precision would be improved. This interplay between MID and sample size is therefore important.

Although our analysis of precision is limited by the small number of comparisons undertaken, and by analytic imprecision due to the limited number of participants, the findings demonstrate the value of the approach taken and are generally reassuring. In the majority of cases the processes of health state development and valuation did not result in the underlying “signal” of a treatment effect on the disease-specific measure being lost in the “noise” of these processes. It is clear that this occurred in the insomnia example, but probably not in the rest. Assessment of precision in this way is a useful addition to the steps that might be taken to assess the value of using vignette-based approaches to utility estimation. Further work is required to specify MID in relation to utility values and to consider the influence of procedural factors on the precision of preference differences based on health states descriptions, for example through the use of different methods for preference elicitation in precision analyses.


The precision of estimates of treatment effects based on preference data obtained from disease-specific measurements in clinically significant studies of health technologies was acceptable. Definition of minimally important difference in utility estimates is required to adequately assess precision and should be the subject of further research.



NHS R&D Programme; National Institute for Health and Clinical Excellence (NICE); NHS Quality Improvement Scotland (NHSQIS). We are extremely grateful to the following for their help: the members of the internet panel, the patients and clinicians who provided help in the development of health state descriptions, Joanne Perry for her project support, Dan Fall (University of Sheffield) and Stephen Elliott (Llama Digital) for website development.

Competing interests


Authors’ contributions

K.S., R.M., J.B. and A.R. conceived the study and, with J.R., designed the evaluation. M.D. developed some of the health state descriptions and contributed to data collection. All authors contributed to the drafting of this report.

Copyright information

© Springer Science+Business Media B.V. 2009