FormalPara Key Points

The SF-6D version 2 (SF-6Dv2) was developed to address concerns with the SF-6D, and to use the SF-6Dv2 in economic evaluation, a UK utility value set has been developed.

The SF-6Dv2 is able to successfully distinguish between groups with known differences, such as disease and healthy groups and outperforms the SF-6D in some but not all cases.

The SF-6Dv2 is also strongly correlated with other health-related quality-of-life instruments.

1 Introduction

The quality-adjusted life-year (QALY) can be used to measure outcomes in the economic evaluation of health interventions. A QALY combines the value of health-related quality-of-life (HRQoL) with the value of length of life into a single index number. Typically, the quality element of a QALY is derived from preference-based measures of health which generate a utility value of HRQoL. One such preference-based instrument is the SF-6D which was developed from the SF-36 [1]. The SF-6D describes health on six dimensions, including physical functioning, role limitations, social functioning, pain, mental health and vitality, on between four and six severity levels, thereby describing 18,000 health states. With the development of country-specific value sets, including in the UK, the SF-6D has become one of the most widely adopted generic preference-based measures of health in economic evaluation [2].

However, the SF-6D has been criticised on the basis of the ordering of severity levels of the physical functioning dimension, as the disparity between ‘a lot’ of limitations in moderate activities and ‘a little’ limitation in bathing and dressing lacks clarity. In addition, there are concerns surrounding the framing of dimensions, which may cause confusion during valuation; the vitality dimension is positively framed, while the other dimensions are negatively framed [2]. The sensitivity of the role dimension has also raised concerns due to claims of a ‘floor effect’, whereby many patients report the lowest severity level [3]. Furthermore, the valuation methods adopted in the SF-6D have faced concerns. First, as the cognitively demanding standard gamble approach was used to elicit values, and, second, as a two-stage valuation process was adopted, this may have caused the impact of risk aversion to be double counted [2]. In response to these concerns, a new version of the SF-6D has been developed (SF-6Dv2) that addresses the problems with the descriptive system [2]. The SF-6Dv2 describes 18,750 possible health states. Moreover, a discrete choice experiment (DCE) with duration was used to value the SF-6Dv2 as opposed to standard gamble, which was used for the SF-6D [4]. Although standard gamble is an appropriate valuation approach, concerns have been raised with how well respondents understand the task.

The development of a new instrument raises questions in regard to its psychometric validity and the implications of its use in economic evaluation to inform Health Technology Assessment (HTA) [5]. An instrument’s validity refers to the extent to which it measures the construct it is intended to measure. One type of construct validity is convergent validity, which refers to the strength of the association between the instrument and other measures of related concepts. A second form of construct validity is known group validity, which refers to the degree to which an instrument is able to differentiate between groups where differences are expected a priori. At present, there is limited evidence on the performance of the SF-6Dv2, including when compared with that of the SF-6D and whether it addresses concerns with the latter measure. Whitehurst et al. provide the starting point for comparative work by analysing how the classification system of the SF-6Dv2 impacts the distribution of responses compared with the SF-6D [5]. The study identifies that the two classification systems lead to variation in the described levels of impairment, particularly for vitality and role limitations. These differences, along with different valuation approaches employed for the two versions, are likely to have an impact on the utility values. It is clear that the SF-6Dv2 has resulted in a wider scoring range (− 0.574 to 1) when compared with the SF-6D (0.301–1) [4]. Given that utility values are applied in the calculation of QALYs in economic evaluation, it is important to assess the impact of using the SF-6Dv2 compared with the SF-6D. Therefore, the aim of this study was to compare the psychometric performance of the SF-6Dv2 with the SF-6D in order to provide evidence of the comparability of and differences in the updated measure and the original measure for HTA. The specific objectives of this paper were to assess the relative performance of the SF-6Dv2 and SF-6D and explore the descriptive differences between utility values (from the UK value set) of the two instruments. In addition, the known-groups validity and the convergent validity of the SF-6Dv2 were evaluated. This evidence is likely to be of interest to agencies who recommend the use of the SF-6D [6].

2 Methods

2.1 Data Source

Data from the Multi-Instrument Comparison (MIC) project were used. The MIC study aimed to compare several HRQoL and well-being instruments across multiple disease areas (asthma, cancer, depression, diabetes hearing problems, arthritis, heart disease, chronic obstructive pulmonary disease [COPD] and stroke). Data were collected online in six countries (Australia, Canada, Germany, Norway, the UK and the US). Additionally, the MIC included a sample of ‘healthy’ individuals with no self-reported health problems. We used data from seven disease areas, excluding the COPD and stroke groups due to small samples that were only drawn from Australia. Excluding these groups provided a total sample of 7932 individuals.

2.2 Measures

2.2.1 SF-6D Version 2 (SF-6Dv2) and SF-6D

Participants in the MIC study completed the SF-36 version 2 (SF-36v2), which is a measure of HRQoL. Both the SF-6Dv2 and SF-6D are derived from the SF-36v2, and the SF-6D classification system is derived from 11 of the 36 items from the SF-36v2 [1]. The SF-6D describes health on six dimensions (physical functioning, role limitations, social functioning, pain, mental health and vitality). Each dimension has between four and six severity levels, thereby describing 18,000 health states. A subset was valued using standard gamble by a UK sample, resulting in utility values ranging from 0.301 to 1.

The SF-6Dv2 is similarly derived from the SF-36v2 but there are multiple sources of variation from the SF-6D [2]. First, different questions from the SF-36v2 were used in the SF-6Dv2 compared with the SF-6D in the role limitations, pain and vitality dimensions (see Online Resource A). There were also differences in the number of severity levels for the physical functioning dimension, which has six levels in the SF-6D and five levels in the SF-6Dv2, and the role limitation dimension, which has four levels in the SF-6D and five levels in the SF-6Dv2. Furthermore, the level descriptions of the SF-6Dv2 were simplified and were made more consistent for clarity [2]. The SF-6Dv2 describes 18,750 possible health states. Online Resource A provides further comparisons of the classification systems of the SF-6Dv2 and SF-6D. DCE with duration was used to obtain utility values for the SF-6Dv2 value set [4]. Utilities of the SF-6Dv2 range from − 0.574 to 1.

2.2.2 Other Measures

Participants in the MIC also completed other generic preference-based measures, including the EQ-5D-5L [7] and the AQoL-8D [8]. The EQ-5D-5L has five dimensions: mobility, self-care, usual activities, pain/discomfort and anxiety/depression. Respondents are able to choose between five levels ranging from ‘no problems’ to ‘extreme problems/unable to’. The EQ-5D-5L can describe 3125 health states that are each associated with a utility score from the cross-walk [9] to the UK EQ-5D-3L value set, which ranges from − 0.594 to 1. The AQoL is a 35-item, 8-dimension, multiattribute utility instrument. The dimensions include independent living, pain, senses, mental health, happiness, coping, relationships and self-worth. The AQoL-8D is an extension of two earlier instruments, the ‘AQoL’ (AQoL-4D) and AQoL-6D. Health states defined by the AQoL-8D are associated with a global utility score, which ranges from 0.105 to 1.

Participants with a health condition also completed an appropriate and validated condition-specific measure. The Depression, Anxiety and Stress (DASS-21), for example, is a validated measure for those with depression that was developed by Lovibond and Lovibond, who also provide clinical cut-offs to define severity, where DASS-21 scores are multiplied by 2 to acquire the total score and therefore range from 0 to 42. We utilise the depression domain total score cut-offs, i.e. < 14, ≥ 14 to < 21 and ≥ 21, to define mild, moderate and severe groups, respectively, which are accepted clinical cut-offs. [10].

Only clinical cut-offs were available and were therefore used for the DASS-21 measure of depression. Clinical cut-offs are not available for the remaining condition-specific measures, although, by nature, the scores of the measures inherently reflect condition severity. In the absence of clinical cut-offs, severity cut-off points were guided by the distribution of responses, the correlation of scores with alternative condition-specific measures and indicators, and mean scores among these correlated indicators. For example, for asthma, we assessed the correlation of the asthma-specific measure with specific questions surrounding breathing difficulties, and examined the mean and median scores of the measure used to define severity by the extent and severity of breathing difficulties. Furthermore, the MIC data also provides responses to a question about disease severity for each condition (rated on a 5-point scale from very mild to very severe) in the German population only. Bearing in mind the single population that responds to the question, we utilise these responses to ensure that the severity cut-offs are broadly correlated with self-reported severity for each condition.

The Asthma Quality of Life Questionnaire (AQLQ-Sydney) [11] is an asthma-specific measure of HRQoL completed by MIC adult respondents with asthma. The asthma global score is calculated as the mean of all 20 items, which range from 0 to 5, and is provided in the MIC. We define mild, moderate and severe groups as those with global scores of < 1.5, ≥ 1.5 to ≤ 2.5, and > 2.5 respectively.

The European Organisation for Research and Treatment of Cancer Quality of Life Questionnaire–Cancer 30 (EORTC QLQ-C30) is a questionnaire developed by Aaronson et al. to assess the quality of life of cancer patients [12]. The calculation of a global mean score from all items is not recommended [13]. As opposed to using the full measure, severity is therefore defined by the response to a single question in the EORTC QLQ-C30 reflecting overall quality of life, albeit not cancer-specific: ‘How would you rate your quality-of-life health during the past week?’ Responses range from 1 (very poor) to 7 (excellent). Responses of > 5, ≥ 4 to ≤ 5, and < 4 are defined as mild, moderate, and severe, respectively.

Diabetes-39 is an instrument used to assess the quality of life of individuals with diabetes [14]. The MIC provides the additive global score, which is the sum of all 39 items, each with seven levels of severity, where higher scores indicate greater severity. The score therefore ranges from 39 to 273. We define whole scale scores of < 115, ≥ 115 to ≤ 160, and > 160, as mild, moderate, and severe diabetes, respectively.

The Abbreviated Profile of Hearing Aid Benefit (APHAB) [15] is used to assess hearing aid benefit and produces scores for aided and unaided performance for the hearing loss sample. We focus on unaided scores given the larger sample size; unaided responses reflect a patient’s difficulty when not using amplification. Severity of hearing loss is defined by the global unaided score, which is equal to the mean of all item scores in the ease of communication, background noise and reverberation subscale scores, and ranges from 1 to 99. Scores of < 40, ≥ 40 to ≤ 70, and > 70 define mild, moderate, and severe hearing loss, respectively.

The Arthritis Impact Measurement Scale 2-Short Form (AIMS2-SF) is a measure of health specifically for individuals with osteoarthritis [16]. The physical, symptom, affect and social subscale scores are summed to give the additive total score, which ranges from 38 to 115. Based on the distribution of the total score and correlations with the pain and discomfort dimension of the E5-5D-5L, we define total scores of > 95, ≥ 82 to ≤ 95, and < 82 as mild, moderate, and severe arthritis, respectively.

The MacNEW heart disease health-related quality-of-life questionnaire is used to assess how daily activities and functioning are affected by coronary heart disease and its treatment [17]. The global score of the measure was calculated as the mean of all 27 items that each comprise seven levels of severity. Scores of > 5, ≥ 4 to ≤ 5, and < 4 denote mild, moderate, and severe heart disease, respectively.

2.3 Analysis

2.3.1 Descriptive Statistics of Utility Values

Statistical analyses were used to compare the measures in terms of the utility values they generated. Means, standard deviations and minimum and maximum values were estimated for each utility measure and presented for the sample overall, and by condition. The distribution of utility values of the SF-6Dv2 and SF-6D was examined and compared using density plots (kernel density) for the full dataset, as well as by the different condition groups.

Bland–Altman plots were employed to identify whether there was evidence of differences in values across the entire severity scale. These plots describe agreement between two quantitative measures and are presented as a scatter plot in which the x-axis represents mean SF-6Dv2 and SF-6D utility scores, while the y-axis shows the difference between SF-6Dv2 and SF-6D values [18].

2.3.2 Known-Group Validity: Discrimination Across Different Groups

Known-group validity was examined by assessing the ability of the SF-6D and SF-6Dv2 to discriminate between respondents with different characteristics. This included the ability to distinguish whether a respondent had an existing health condition, as opposed to being in the ‘healthy’ group, and by severity within condition groups. We also examined groups based on age, as health is expected to deteriorate with age. Based on the distribution of age, four age categories (18–44, 45–54, 55–64 and 65 + years) were used. In addition, since data on smoking and education are also provided in the MIC data, we also examine groups based on these observable characteristics, as a relationship between these characteristics and health has been found elsewhere [19].

To evaluate the relative performance of the two SF-6D versions, mean differences and effect sizes (Cohen’s D) were estimated in order to compare groups. The effect size is estimated by dividing the difference in mean scores between two adjacent subgroups by the pooled standard deviation of scores. Effect sizes of ≥ 0.2 to < 0.5, ≥ 0.5 to < 0.8, and ≥ 0.8 denote small, medium, and large effect sizes, respectively [20]. To confirm differences, the statistical significance of differences between the known groups were tested using a T test to compare two groups or an overall F-test from an analysis of variance (ANOVA) when comparing across groups where there were more than two groups. In order to confirm the findings of the known-group validity analyses, we employ further tests suggested by Janssen et al., including the ratio of the area under the receiver operating characteristic curve (AUROC) and F-statistics from ANOVA of the SF-6Dv2 and SF-6D for the groups being compared [21].

2.3.3 Convergent/Divergent Validity

Ideally, we would like to assess the convergent validity of the SF-6Dv2 against a gold-standard validated measure of health that captures all potential health outcomes. Such a gold-standard external validator does not exist for HRQoL. We therefore examined the relationship between the SF-6Dv2 and the original SF-6D, as well as how it was related to EQ-5D-5L and AQoL-8D using correlations. Spearman rank correlations were used to assess the association between the dimensions of the measures since the data are ordinal. Pearson correlation coefficients were used to compare (continuous) utilities across measures. Correlations were assessed as ≥ 0.5, strong; < 0.5 to ≥ 0.3, moderate; and < 0.3, weak [20].

Higher correlations were expected between similar dimensions across the measures since they attempt to capture the same aspects of health; for instance, the physical functioning dimension of the SF-6Dv2 should be highly correlated with this same dimension measured by the SF-6D. In a similar manner, high correlations are expected between SF-6Dv2 physical functioning and EQ-5D mobility; SF-6Dv2 pain and EQ-5D pain/discomfort; SF-6Dv2 mental health and EQ-5D anxiety/depression; SF-6Dv2 mental health and AQoL mental health; and SF-6Dv2 pain and AQoL pain. Furthermore, the existing literature has highlighted common dimensions and has identified high correlations between these dimensions and the SF-6D dimensions [22, 23].

3 Results

3.1 Descriptive Statistics of Utility Values

In the MIC data, there were 436 unique utility values for the SF-6D, which is 2.4% of all 18,000 states and 11% of all 4000 unique possible values that the SF-6D may take [4]. Comparatively, 961 unique values were reported for the SF-6Dv2, which is 5.1% of all 18,750 states described and 6.4% of all 15,000 possible values. There were 160 unique values that were < 0 in the SF-6Dv2 (i.e. 17% of unique values defined in the MIC data).

With the exception of the ‘healthy’ and hearing problem subsamples, the mean SF-6Dv2 utility values were lower than the SF-6D utility values (Table 1). The largest difference in the SF-6D and SF-6Dv2 scores by disease group was observed in the depression group, while the smallest difference was in the hearing problems group, where the difference between the SF-6D and SF-6Dv2 was statistically insignificant in contrast to all other groups and in the full sample. The full utility value range was observed for all groups for the SF-6Dv2 (− 0.574 to 1) and the SF-6D (0.301–1), with the exception of arthritis (Table 1).

Table 1 SF-6Dv2 and SF-6D utility values by ‘healthy’ and disease groups

The SF-6Dv2 and SF-6D utility scores were negatively skewed, with a mean of 0.67 and 0.71 and a median of 0.77 and 0.7, respectively (Fig. 1). The distribution of scores among the disease group subsamples were similar to the full sample for some, but not all, disease groups. In particular, depression and diabetes disease groups had a ‘flatter’ distribution of SF-6Dv2 utility scores in contrast to the SF-6D distribution, which was negatively skewed (Fig. 1). The Bland–Altman plot showed evidence of agreement between the utility values of the measures, particularly for higher utility score (i.e. individuals reporting ‘better’ or full health) (Fig. 2). However, there was evidence of disagreement between values with low average scores, i.e. those with ‘low’ health and with an average utility score of approximately < 0.4.

Fig. 1
figure 1

SF-6Dv2 and SF-6D scores of the full sample and by disease groups

Fig. 2
figure 2

SF-6Dv2 and SF-6D Bland–Altman plot. The Bland–Altman plot shows agreement for higher values of SF-6Dv2 and SF-6D, but poor agreement for people with low index values of approximately 0.4 or less in both instruments. The red line indicates the mean difference and the black lines represent the difference (±) 1.96 multiplied by the standard deviation

3.2 Known-Group Validity: Discrimination Across Different Groups

The mean values of SF-6Dv2 and SF-6D utility values by each known group are presented in Table 2 (and presented graphically in Online Resource B). The results from the analyses indicated that both the SF-6Dv2 and SF-6D had the ability to distinguish whether a respondent has an existing health condition, as opposed to being in the ‘healthy’ group (Table 2). The difference in means within groups was larger in the SF-6Dv2 than the SF-6D and all differences were statistically significant. The mean difference was greatest in the depression group (relative to the ‘healthy’ group) and smallest in the hearing problems group. Effect sizes overall were generally large, with the exception of hearing problems, where the effect size was 0.403 for the SF-6Dv2 and 0.453 for the SF-6D, and asthma, where the effect size is equal to 0.75 for the SF-6Dv2 (Table 2). Owing to larger standard deviations across groups, SF-6Dv2 effect sizes were generally smaller than the SF-6D, with the exception of diabetes and arthritis.

Table 2 Known group validity

Both the SF-6Dv2 and SF-6D distinguished well between levels of severity within disease groups (Table 2). The significant differences in means were consistently larger in the SF-6Dv2 than the SF-6D across all disease groups. In the majority of diseases, the difference in mean SF-6Dv2 scores was greatest in the moderate versus severe groups as opposed to the mild versus moderate groups (Table 2). The exception to this was the hearing problem group. This was not always the case for the SF-6D utility values and, for some groups, the two versions did not agree on the relative order of the differences in means between the severity levels. For example, within the cancer group, the difference in the mean value of the SF-6Dv2 for the moderate versus severe groups was 0.304 and greater than the mild versus moderate groups, equal to 0.163; however, in the SF-6D, the largest difference in means was observed in the mild versus moderate groups (Table 2). The standard deviation of the SF-6Dv2 increases with severity across all disease groups; this pattern was not observed in the SF-6D for all diseases. With the exception of hearing problems, effect sizes were medium to large for both versions. In all but the hearing group, larger effect sizes were observed for the SF-6D than the SF-6Dv2 when comparing the mild versus moderate severity groups. The effect sizes of the SF-6Dv2 are greater than the SF-6D when comparing the moderate versus severe groups (with the exception of the hearing and asthma groups). The results also demonstrate that both the SF-6D and SF-6Dv2 were able to distinguish between other characteristics, including age, education and smoking status, with the SF-6Dv2 showing larger differences than the SF-6D, although effect sizes were small (Table 2). These findings are confirmed by the analysis of the AUROC and F-statistic ratios (available in Online Resource B).

3.3 Convergent Validity

Table 3 provides the results from the convergent validity analyses. The SF-6Dv2 and SF-6D utility values were strongly correlated (rho = 0.84) and equivalent dimensions had very strong correlations (0.74–0.93), with the exception of vitality, which had a correlation of 0.51 (Table 3). The SF-6Dv2 utility values were also strongly correlated with the EQ-5D utility score (rho = 0.811). Strong correlations are observed in the domains that we expect to be highly correlated, including between the SF-6Dv2 pain and EQ-5D pain and discomfort dimensions (0.764), the SF-6Dv2 mental health and EQ-5D anxiety and depression domains (0.628), and the SF-6Dv2 physical functioning and EQ-5D mobility (0.514). Further strong correlations are observed between the SF-6Dv2 physical functioning and EQ-5D usual activities and pain/discomfort domains. Weak correlations were identified between all SF-6Dv2 dimensions and EQ-5D self-care, while the SF-6Dv2 mental health dimension did not strongly correlate with any EQ-5D dimension, other than anxiety/depression (Table 3). SF-6Dv2 utilities were similarly strongly correlated with AQoL utilities (0.762) and strong correlations were observed across the majority of domains (56%). As expected, strong correlations were observed for equivalent dimensions on mental health and pain (− 0.697 and − 0.820, respectively). The SF-6Dv2 social functioning, mental health and vitality domains were strongly correlated with the majority of the AQoL domains. The AQoL sense dimension was weakly correlated with all SF-6Dv2 domains, while the SF-6Dv2 physical functioning domain was the dimension with the fewest strong correlations with AQoL dimensions (Table 3).

Table 3 Convergent validity

4 Discussion

This paper utilised the MIC data to compare the psychometric performance of the SF-6Dv2 in terms of known group validity and convergent validity in a mixed sample of healthy individuals and individuals with specific conditions (asthma, cancer, depression, diabetes, hearing problems, arthritis and heart disease). Overall, we identified large differences in the absolute utility values and observed an increased range and variance of the SF-6Dv2 relative to the SF-6D. SF-6Dv2 utility values were generally lower than those from the SF-6D and across disease groups, but were higher for ‘healthy’ samples. Utility values in the health conditions ranged from 0.757 to 0.446 for SF-6Dv2 and 0.75 to 0.601 for SF-6D, with hearing having the highest utility values while depression had the lowest values for both measures. The wider range of values in the SF-6Dv2 reflects the wider range of values in the value set based on DCE with duration, which ranges between − 0.574 and 1, compared with the standard gamble value set for the SF-6D, which ranges between 0.301 and 1. This wider range generates large differences between those with and without a condition for the SF-6Dv2, which is likely to have an impact on QALY estimates. Although there were larger differences between utility values in the two versions for the ‘conditions’ groups, ranking across them was mainly consistent (hearing, asthma, heart disease, cancer, diabetes, arthritis and depression), although in the SF-6D ranking, diabetes precedes cancer; the difference is small and is therefore most likely insignificant. The standard deviations were larger for the SF-6Dv2, which led to marginally smaller effect sizes for most conditions, compared with the SF-6D. Larger standard deviations will have an impact on uncertainty, which will be evident in the sensitivity analysis of economic evaluations.

Similarly, although both measures were able to discriminate across severity groups, the SF-6Dv2 had larger differences compared with the SF-6D. For example, the difference between utility values for moderate and severe condition-specific levels of severity ranged between 0.289 and 0.362 for arthritis, cancer, depression and heart disease, whereas none of the differences were more than 0.11 for SF-6D. In some cases, there was a larger difference between utility values for the mild and moderate levels of severity for the SF-6D, e.g. for cancer, diabetes, arthritis and heart disease, although these differences were not large. This did not occur for the SF-6Dv2, apart from in hearing where, although both versions were able to discriminate between severity levels, the differences were not as large as for the other conditions. This reflects the generic nature of the SF-36v2, which does not include hearing as a dimension. These differences may reflect both the changes in choice of items for the classifier and the changes to the valuation approach. The items selected for SF-6Dv2 aimed to be consistent, especially with regard to role limitations (which included a more severe level of problems), vitality (which was previously a positively framed item) and physical functioning (where a severity level was removed). Whereas changes to the role limitations and vitality dimensions may have resulted in lower values, the change made to the physical functioning dimension is less likely to explain the observed differences.

The convergent validity analysis showed strong associations between the SF-6Dv2 and the SF-6D utilities and across equivalent dimensions. The exception was vitality, which is not unexpected since a negatively framed item was used in the SF-6Dv2 compared with a positively framed item in the SF-6D. In addition, the SF-6Dv2 was strongly associated with the EQ-5D utility score and the AQoL-8D utility score. Strong correlations were identified between dimensions where expected, for instance between the SF-6Dv2 pain domain and the pain domains of both the EQ-5D and AQoL, and the SF-6Dv2 mental health and EQ-5D anxiety/depression and AQoL mental health domains.

Newer versions of measures that aim to address previous limitations in the descriptive systems and valuation methods will have an impact on the utility values derived from preference-based measures, e.g. the changes in EQ-5D from the 3L to the 5L [24]. The differences in the SF-6D and SF-6Dv2 stem from, first, improvements made to the classification system of the SF-6D, which have resulted in variation in the described levels of impairment, although the evidence suggests that these changes have made little difference to the psychometric performance, and, second, the valuation approach employed, which was adapted to address concerns with the SF-6D. Given the little difference in effect sizes and F-statistics, the likely greatest contributing factor to the differences in utility values between the SF-6Dv2 and the SF-6D is the change in the valuation method to DCE with duration, which has been shown to generate lower utility values. The DCE with duration results make the SF-6Dv2 more consistent with the EQ-5D and other measures of preference-based measures of health in terms of the range and values. This would seem to suggest that the variant of standard gamble with the chained procedure created an artificial floor effect, which is consistent with the finding of Tsuchiya et al. [25], who revalued the SF-6D states using time trade-off and identified that values were closer to the EQ-5D.

The study benefits from assessment of the SF-6D and SF-6Dv2 in a large international sample across multiple conditions and a number of other commonly used generic preference-based measures. The availability of condition-specific measures to judge the performance was also useful as it provides information on how well the two versions discriminate across severity groups. However, most of the measures do not have clinically accepted cut-offs. The approach to define groups based on distribution and an external measure of severity available for one of the countries aimed to provide a more objective choice for the cut-offs but these cut-offs were sample-dependent. Furthermore, although known group and convergent validity are assessed, the data were cross-sectional, therefore responsiveness and test–retest ability of the SF-6Dv2 could not be undertaken. Future work should examine the validity of the measure across different time points, settings and populations. The final impact of the changes from SF-6D to SF-6Dv2 would also need further testing in the context of trials as the impact may vary depending on the condition and impact of the intervention [26]. This study provides evidence of comparison across the two versions of the SF-6D, which is important for decision makers looking to use either measure.

5 Conclusion

This study shows that the development of the new preference-based measure of health, the SF-6Dv2, and the production of its UK value set provided multiple sources of potential deviation from the original SF-6D utility values, which result in larger absolute differences between patient and healthy samples and between different patient severity groups. Due to these differences, the selection of the SF-6Dv2 over the SF-6D is likely to influence the QALY results and this has potential implications for the economic evaluation of health interventions. Future work should assess the impact in the context of economic evaluations.