Background

Health-related quality of life (HRQL) has become important in fields such as health policy, clinical practice and health outcome evaluation [16]. The use of outcome models combining HRQL with morbidity or mortality is increasing. This has led to the development of a series of measures combining HRQL and traditional outcome indicators, such as QALY [79]. HRQL preference/utility-based measures are also increasingly used to obtain QALYs [10, 11].

Since there are many preference/utility-based instruments, there is a general demand for head-to-head comparison studies between different scores [12]. Many of the studies carried out have focussed on the SF6D, a new tool representing the link between psychometric and preference/utility-based measures. Most investigations comparing SF6D with other widely used measures (e.g., EQ5D or HUI) have obtained contradictory findings. In fact, it has been argued that, since these measures have different scoring functions and define a different number of health levels, it is not clear if their results can be compared. SF6D is more sensitive than EQ5D and HUI3 in healthy people, and in detecting small health changes, especially at the top of the scale [1114]. SF6D also shows a more continuous and normal value distribution than EQ5D and HUI3, which often have very skewed distributions or distributions clustered around a few values. Indeed, EQ5D tends to suffer from clustering and ceiling effect in patients with middle-severity health status [15]. On the other hand, SF6D produces higher values than EQ5D or HUI3 at the lower end of the scale. Thus, SF6D is likely to overestimate the values of very impaired health status [1622]. Since a recent article [23] reported evidence that SF6D can describe poor health status, it is not yet clear whether SF6D has greater sensitivity and accuracy in the description of health and advantages compared with other widely used, shorter measures, such as EQ5D [20, 24].

EQ5D and HUI3 have only been compared in a limited number of studies. Findings suggest that they give similar results, though HUI3 discriminates better between lower levels of impairment [11, 20, 25, 26].

Although a certain degree of correlation has been found between these measures, which seem to assess a similar essential construct, it is still not clear whether different measures give similar results for HRQL. These previous studies underlined that it is difficult to establish whether one tool is better than another because all have strengths and weaknesses. However, the need for a standard measure is increasingly felt because it has been seen that QALYs calculated with different tools give different results [11, 16, 19]. Further research is also necessary because many of the studies carried out so far suffered from limitations, such as small sample sizes [16, 17, 21], or poor generalisation because of focussing on a single population of patients with a very specific disease [1820, 22, 26]. Many of these studies [13, 17, 18] only compared two tools, while a broader examination of several different measures could be useful. Finally, few studies have investigated the influence of socio-demographic or morbidity factors on the performances of different measures [27]. This aspect could be important for two reasons: it could affect the comparability of tools, and it could highlight which tools are most sensitive for distinguishing health differences between populations or groups within populations.

The aim of this study was to compare three distinct HRQL measures, SF6D, HUI3 and EQ5D, in a population not affected by specific diseases, specifically:

  1. (1)

    formal agreement and correlation between measures;

  2. (2)

    similarities and differences in the way measures are independently related to a range of self-reported socio-demographic and morbidity measures.

Methods

Study sample

Cross-sectional data were collected on a sample of patients attending general practices in two Italian cities (Turin and Siena) from May 2003 to April 2004. Turin is a big industrial city (population 1,000,000), whereas Siena is a town (population 50,000) with little industry. They were chosen because a further aim of our study, not described in this paper, was to describe the HRQL of GP patients in different environments for local health policy purposes. All patients attending the general practices during the study period were eligible. The GPs were recruited with the collaboration of the Primary Health Care Unit of the Local Health Authority 2, ASL TO 1 of Turin and the Siena division of the Italian Federation of General Practitioners (Federazione Italiana Medici di Medicina Generale). Seven GPs in Turin and nine in Siena agreed to participate. In the end, 467 patients were enrolled in Turin and 544 in Siena, making a total of 1,011 patients. There were no stated patient exclusion criteria, but children (<16 years old) and severely ill (hospitalised or bed ridden) patients were not recruited due to the type of patients attending the GPs.

Study procedures

The GPs were recruited, and the licenses to use the questionnaires and the algorithms for calculating the scores were acquired. HRQL data were collected using the Italian version of the three questionnaires: SF36, HUI3 and EQ5D. They were administered sequentially to the patients at the general practices. The patients were informed about the study by the GPs and were asked to participate. Written informed consent was obtained from all who accepted. Refusals were not recorded. The following information was also obtained from each patient: date of birth, education, marital status (married, not married, divorced/widowed), smoking status (non-smoker, current smoker, ex-smoker), height and weight to calculate BMI, and any history of hypertension, cardiovascular diseases, diabetes or musculoskeletal disorders (morbidity indicators were chosen according to the advice of the GPs who indicated their patients’ main complaints).

SF6D score was estimated from the SF36 questionnaire using the established algorithm. EQ5D score was calculated using United Kingdom (UK) preference weights [28] because Italian ones were not available. Patients were divided into 10-year age groups, except for the youngest and oldest (<30, 30–39, 40–49, 50–59, 60–69, 70+). Years of education were classified according to the Italian school system. BMI was divided according to US Centers for Disease Control & Prevention guidelines (BMI ≥ 25 = overweight) [29].

HUI3

HUI3 is a preference/utility-based measure that describes eight attributes, “vision”, “hearing”, “speech”, “ambulation”, “dexterity”, “emotion”, “cognition” and “pain”, and is able to identify 972,000 health states. The HUI system is based on a “within the skin” approach to health status assessment that concentrates on physical and emotional aspects, ignoring social ones. The score ranges from 0 (death) to 1 (full health), but can also take negative values that indicate states worse than death [3034].

EQ5D

Euroqol5D is a preference/utility-based measure. The score ranges from 0 (death) to 1 (full health), but may take negative values. It describes five dimensions: “mobility”, “self-care”, “usual activities”, “pain/discomfort” and “anxiety/depression”, and identifies 243 health states [28, 3537].

SF36

SF36 is a psychometric measure that produces a profile with eight dimensions: “physical functioning”, “role limitations due to physical problems” and “role limitations due to emotional problems”, “pain”, “general health”, “vitality”, “social functioning” and “mental health” [38]. The eight dimensions can take values from 0 (worst health) to 100 (best health) [39].

SF6D

SF6D is a preference/utility-based measure created to obtain a score from the SF36. The eight dimensions of SF36 were reduced to six SF6D dimensions: “physical functioning”, “role limitation”, “social functioning”, “pain”, “mental health” and “vitality”. The SF6D identifies 18,000 health states. The score, which ranges from 0 (death) to 1 (full health), can be calculated from the SF36 if the ten items used to identify the six dimensions of SF6D are completed [16, 40, 41].

Data analysis

The analysis was carried out using Stata 8.1. The questionnaire score distributions were analysed. Correlation was investigated using the Spearman coefficient. Agreement between scores was investigated using the Bland and Altman [42] method.

Finally, how the questionnaire scores were influenced by socio-demographic and morbidity indicators was studied by univariate and multivariate analysis, using the nonparametric quantile (Least Absolute Value) regression. Missing values were excluded because they represented only a small percentage of the data (6% for BMI and 3% for chronic diseases) and because the groups of respondents and non-respondents were similar in socio-demographic and morbidity characteristics. Significant (p < 0.05) differences were only found for marital status (the respondent group contained a higher proportion of married persons).

Multivariate analysis was carried out fitting a forward step-up model. Only variables found to be associated with questionnaire scores by the univariate analysis were entered in the model. It was built up starting from the socio-demographic indicators age and sex, which were considered important factors. Then education and marital status were entered followed by indicators related to lifestyle, smoking status and BMI. Finally, the role of chronic diseases was considered. Wald tests were carried out to compare the new models with the previous ones.

Results

Basic characteristics of population surveyed

The mean age of patients was about 49 years (SD = 18). There was a higher percentage of females (59%) than males (41%). Sixty-one percent of the patients were married. About half of the sample had never smoked. Current and ex-smokers were more numerous among males (25 and 33%, respectively) than females (22 and 14%, respectively). Mean BMI was 24 (SD = 3.9). Thirty-three percent of the sample reported being overweight. Males were more often overweight (44%) than females (25%). Twenty-two percent of patients suffered from hypertension, 28% from musculoskeletal diseases, 8% from cardiovascular diseases and 5% from diabetes (Table 1).

Table 1 Characteristics of the study population (N = 1011)

Questionnaire scores

SF6D scores showed an almost normal distribution (Fig. 1), whereas HUI3 and EQ5D scores were skewed to the left (Figs. 2, 3). SF6D produced a narrower range of values: while EQ5D and HUI3 showed negative scores, the minimum SF6D score was 0.301 (SF6D mean 0.70, SD 0.11, median 0.71, 25th percentile 0.62, 75th percentile 0.79, maximum 0.94; EQ5D mean 0.80, SD 0.20, median 0.80, 25th percentile 0.72, 75th percentile 1, minimum −0.594, maximum 1; HUI3 mean 0.76, SD 0.24, median 0.85, 25th percentile 0.70, 75th percentile 0.91, minimum −0.371, maximum 1). EQ5D showed a ceiling effect (31% of people scored the highest value, compared to 6.5% for HUI3 and 0% for SF6D). The distribution of EQ5D clustered around a few values with a big gap between 0.883 and 1.

Fig. 1
figure 1

SF6D frequency distribution

Fig. 2
figure 2

HUI3 frequency distribution

Fig. 3
figure 3

EQ5D frequency distribution

Correlation and agreement

The relationship between the questionnaire scores is described in Figs. 46. SF6D and EQ5D showed the highest level of association. This was confirmed by Spearman coefficient, which was 0.59 for the association between SF6D and EQ5D, 0.58 for SF6D and HUI3 and 0.57 for EQ5D and HUI3.

Fig. 4
figure 4

Relationship between EQ5D and HUI3

Fig. 5
figure 5

Relationship between SF6D and HUI3

Fig. 6
figure 6

Relationship between EQ5D and SF6D

The best agreement was found between SF6D and EQ5D (95% limits of agreement from −0.414 to 0.230). The worst agreement was achieved by HUI3 and EQ5D (95% limits of agreement from −0.463 to 0.387).

Multivariate analysis

Seven models were constructed for each questionnaire. Table 2 shows the results of the first model, which included age and sex, and the final model, where all the other variables were entered. In the case of SF6D, after adjusting for all the variables, females showed an average score 0.068 points lower than males (regression coefficient = −0.068, p < 0.001). Ex-smokers showed a slightly lower score than no-smokers (regression coefficient = −0.023, p = 0.027). Overweight people had an average score 0.018 points lower than people of normal weight (regression coefficient = −0.018, p = 0.020). Hypertension, cardiovascular diseases and musculoskeletal diseases show an effect on the scores (respectively: regression coefficient = −0.036, p = 0.001; regression coefficient = −0.035, p = 0.029; regression coefficient = −0.054, p < 0.001). Age, education and marital status did not influence scores.

Table 2 Results of the multivariate analysis investigating the influence of socio-demographic and morbidity indicators on questionnaires scores

In the case of EQ5D, females showed an average score 0.085 points lower than males (regression coefficient = −0.085, p < 0.001). People in the age group 60-69 showed an average score 0.058 points (p = 0.028) lower than people in the youngest group. Married people had an average score higher than not married people (regression coefficient = 0.032, p = 0.056). Overweight people had an average score 0.027 points lower than people of normal weight (regression coefficient = −0.027, p = 0.028). Hypertension and musculoskeletal diseases show an effect on the scores (respectively: regression coefficient = −0.027, p = 0.065; regression coefficient = −0.071, p < 0.001). EQ5D scores were also influenced by smoking status: current smokers had a score 0.035 points lower than no smokers (regression coefficient = −0.035, p = 0.010).

In the case of HUI3, there were no significant gender differences in score values (females, regression coefficient = −0.012, p = 0.303). The score increased with higher education levels. Hypertension, cardiovascular diseases and musculoskeletal diseases showed an effect on the scores (respectively: regression coefficient = −0.037, p = 0.014; regression coefficient = −0.055, p = 0.013; regression coefficient = −0.081, p < 0.001).

Discussion and conclusions

This study compared three widely used HRQL measures and tried to clarify if they measure HRQL in a similar way. The three measures showed different score ranges and distributions, but EQ5D and HUI3 were more similar to each other in distribution, mean, median, maximum and minimum score than to SF6D. SF6D scores appeared to have a more normal distribution and covered a narrower range of values. EQ5D showed a ceiling effect with 31% of people scoring the highest value. As highlighted by studies with similar results, EQ5D seems to fail to describe mild-severity health levels [16, 27].

In fact, with only three levels for each dimension, EQ5D does not grade between fair and good health. However, SF6D produced higher values for health conditions at the lower end of the scale. The lowest SF6D score obtained in this study was 0.301, while EQ5D and HUI3 produced negative values. Other studies with similar results [11, 13, 21] interpreted this performance as a poor ability of SF6D to distinguish severely impaired status. Overall, these findings seem to confirm those of other studies: SF6D produced higher values at the lower end of the scale and EQ5D at the upper end.

Differences between the questionnaires were also outlined by agreement between scores in the low range. The 95% limits of agreement were quite large, ranging approximately from −0.5 to 0.3, which, on a score scale from 0 to 1, is an important discrepancy. The best agreement was achieved by SF6D and EQ5D and the worst by EQ5D and HUI3. These results were probably due both to construct and statistical issues. Regarding the construct issue, HUI3 describes eight dimensions of health, six of which are related to particular aspects (such as vision or hearing), all focussed on the physical area of health. Only two scales are related to emotional and mental health, and none to social aspects of health. In fact, HUI3 is based on a “within the skin” approach to health status assessment that concentrates on physical and emotional areas and eliminates the social one because it is “outside the skin” [3134]. On the contrary, SF6D and EQ5D not only describe physical and emotional dimensions of health, but also the social one. These different constructs could explain the worse agreement between HUI3 and the other measures. Regarding the statistical issue, the particularly poor agreement between HUI3 and EQ5D could be due to the skewed distributions of their scores. They both showed a ceiling effect, but, in the case of EQ5D, there is also major clustering of scores around a few values, which decreases the heterogeneity of the sample and hence the level of agreement whose statistics rely upon variance. Moreover, the extreme score values can lead to occurrence of outliers in the differences distribution and therefore widen the limits of agreement.

However, the scores showed a good correlation. A Spearman coefficient of 0.6 is described as very high [43]. In our study, the Spearman coefficient was around 0.6 for all questionnaires, indicating a good level of correlation, especially between SF6D and EQ5D. This level of correlation was similar, but to some extent inferior to those found in other studies [18, 20].

These findings highlighted that the results of HRQL measures may be influenced by their frameworks and the different methods used to calculate the scores. For example, SF6D could overestimate the health status of persons with severe illness and could be more suitable for surveys on the general population or people with a fair to good health status. However, EQ5D seems to overestimate middle-severity health status and could therefore be less suitable for describing the health status of the general population and more useful for patients with invalidating disease. Moreover, the three questionnaires are not interchangeable, and their results cannot be compared because their results show poor agreement, especially between HUI3 and EQ5D. This aspect could be a major issue in comparisons among populations because health status measured with different questionnaires is unlikely to be comparable.

Considering the above, it comes as a surprise to discover that the three measures had similar performance, especially SF6D and EQ5D, in relation to socio-demographic and clinical variables. In fact, this study highlighted that SF6D and EQ5D scores are influenced by the same pattern of factors. Multivariate analysis showed scores of the two questionnaires were influenced, in particular, by gender, with females showing poorer health than males, and by chronic diseases, especially musculoskeletal. HRQL seems to be influenced by the impact that diseases have on daily life rather than by the severity or possible complications of a disease. In fact, musculoskeletal diseases, usually painful and debilitating, influence HRQL more than hypertension or cardiovascular diseases. The two questionnaires show a gender difference in health, though both are also influenced by factors related to lifestyle, such as smoking and BMI.

HUI3, on the contrary, had slightly different performance. It seemed to be influenced by educational level and especially hypertension, cardiovascular and musculoskeletal diseases. However, HUI3 did not reveal health differences between males and females and did not seem to be influenced by factors related to lifestyle. This different performance could be related, as mentioned above, to the approach of the questionnaire, which focusses on physical and emotional aspects and excludes social ones.

These results give rise to some considerations. First, the scores of all questionnaires were influenced by musculoskeletal diseases, which are conditions characterised by physical pain, difficulty of movements, immobility and, often, partly because of the associated pain, deterioration in daily activities, and which, therefore, could have an effect also on vitality and social life. Therefore, HRQL seems to be influenced more by painful, and consequently daily-activities-limiting, conditions than by diseases that may be more serious, but are often asymptomatic. This could be a limit as well as a way to highlight different aspects of health. In fact, HRQL measures may underestimate health status for painful, but not fatal diseases, while overestimating health status in the case of serious, but asymptomatic diseases. HRQL measures may therefore help detect health needs that would otherwise remain concealed, but should probably be used to integrate other health measures, such as mortality, which are more objective, but more crude, or they could be “adjusted” for morbidity conditions assessed by more objective methods (such as morbidity indexes, which describe the severity of a disease).

Secondly, none of the three questionnaires, with the exception of EQ5D, seemed to be influenced by age, after adjusting for the other variables. This suggests that most of the decrease in HRQL in old age is due to factors other than age itself, such as diseases or other conditions like loneliness, which are more frequent in the elderly.

Thirdly, SF6D and EQ5D show a similar capacity of discrimination, while HUI3 seems to be less able to distinguish different categories of people. In particular, EQ5D seems to be the only one to detect some health differences between age groups and among smokers and non-smokers, whilst SF6D identifies differences between non-smokers and ex-smokers. These results should be considered when choosing a measure. Although the questionnaires have diverse frameworks and their crude scores may be different and difficult to compare, they appeared to be influenced by socio-demographic and morbidity variables in a similar way, especially EQ5D and SF6D. This shifts emphasis from the structural and construct similarities of different instruments to the behaviour that they reveal when applied in the field. The study endeavoured to examine the performance of the three measures when used to describe patients’ condition and their determinants instead of merely comparing ranges or distributions of scores. The results obtained could help in the choice of instrument, also considering that this study did not focus on a group of patients with a specific disease in order that the results are more generalisable.

The present study shows some limitations: (1) all the information about morbidity indicators is self-reported by patients so they could be misclassified; (2) participation was voluntary so there could be selection bias; (3) refusals were not recorded so it was impossible to assess whether people who refused to answer the questionnaires differed from people who agreed. However, for the aim of the study, these possible sources of error should not be of great concern, because the biases would involve all three instruments in the same way, and comparison would not be altered. However, these possible sources of error could affect the ability of the study to generalise the findings. Another problem could be the order in which the questionnaires were administrated. Since they were always administered in the same order, the last one could have suffered from loss of accuracy. However, EQ5D, the last one allocated to patients, is the shortest and easiest, so its impletion was as good as for the other two.

In conclusion, our results show that EQ5D and HUI3 were closer to one another in many ways (score distribution, mean, median, minimum and maximum), but SF6D and EQ5D scores were more similar in the way they were influenced by socio-demographic and morbidity indicators. In some cases, such as for smoking and age, EQ5D had better discrimination capacity. It is difficult to determine which is the best instrument, but, apart from a descriptive capacity similar or better than the other instruments, EQ5D seems to have the advantage of being easier to answer.