Health-related quality of life (HRQoL) has been extensively used worldwide as a multidimensional concept that could be used to assess an individual’s health status based on physical, mental, and social functioning [1,2,3]. HRQoL can be evaluated using generic preference-based measures (GPBMs), which are commonly used in economic evaluations of healthcare interventions [4, 5]. A GPBM consists of a health state descriptive system and a corresponding country-specific health utility value set elicited from a representative sample of the general population. The health utility lies on a standard scale, where the upper boundary 1 refers to full health, 0 refers to death, and values lower than 0 refer to the health states that are deemed as worse than death. It provides a standardized weight to interpret the severity of the health state [6]. Given the acceptable cognitive burden for the respondents, the GPBMs are increasingly used in population health surveys [7,8,9]. A population health survey provides integral information on the overall situation and longitudinal trend of the health status of the residents, as well as the empirical evidence for supporting healthcare decision-making [10, 11].

The EQ-5D and the Short Form Six-Dimension (SF-6D) are the two most frequently used GPBMs worldwide [12, 13]. The EQ-5D was developed by the EuroQol Group, and currently has two versions, i.e., the EQ-5D-3L and the EQ-5D-5L. Both versions have the same dimensions to describe health states, while having different response levels (three levels in EQ-5D-3L and five levels in EQ-5D-5L) for each dimension [14, 15]. In comparison with the EQ-5D-3L, the EQ-5D-5L defines a wider range of health state descriptions, thus reducing ceiling effects and enhancing discriminant properties [15,16,17]. The original version of the SF-6D (SF-6Dv1) was developed based on the 36-item Short-Form Health Survey (SF-36) in 2002 and comprises six dimensions [18]. These dimensions are combined with four to six levels of severity, yielding up to 18,000 health states [18]. Another version of the SF-6D was developed based on the 12-item Short-Form Health Survey (SF-12) in 2004 [19]. It has the same six dimensions but different levels in each (three to five levels), defining 7500 health states [19]. More detailed information and empirical evidence of the difference between these two versions can be found elsewhere [9, 18, 19]. The newest version of the SF-6D, the SF-6Dv2, was recently developed by revising the ambiguity between the dimension levels and unifying the inconsistency of positive and negative wording in the SF-6Dv1 [20, 21].

Several studies have been conducted to compare the measurement properties of the EQ-5D and SF-6D in various types of diseases, such as diabetes, cardiovascular disease, cancer, chronic obstructive pulmonary disease, and end-stage renal disease [22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41]. All these studies were conducted to compare the SF-6Dv1 with EQ-5D-3L or EQ-5D-5L. A common finding in most studies was that the EQ-5D and SF-6D appeared to be generally reliable, valid, and responsive (or sensitive) to measure the HRQoL among the disease populations. Although the test–retest reliability of the SF-6D might be higher than that of the EQ-5D [33], the results of comparing discriminate validity (or known-group validity), as well as the responsiveness, were not consistent across studies [22, 23, 25,26,27, 29,30,31,32,33, 38, 40, 41].

Nevertheless, there have been only a few comparisons between the EQ-5D and SF-6D among the general population or in population health surveys [11, 42,43,44,45,46]. Most of these studies involved the EQ-5D-3L and SF-6Dv1, except one study which was conducted to compare the EQ-5D-5L with SF-6D (derived from the SF-12) in the Thai general population [42]. Although generally good convergent validity between the EQ-5D and SF-6D was observed [42,43,44], the discriminate validity varied across different studies. For example, Zhao et al. [43] found that the SF-6Dv1 had a higher level of discriminant validity than the EQ-5D-3L, while Bharmal et al. [45] illustrated that the EQ-5D-3L performed better than the SF-6Dv1 in the discriminative power. The responsiveness was compared in only one study, and it was found that the EQ-5D-5L was more responsive than the SF-6D (derived from the SF-12) for the respondents with worse health status [42]. No studies have been conducted to compare the reliability of the EQ-5D and SF-6D in the general population. Therefore, evidence comparing the measurement properties of the SF-6Dv2 and EQ-5D in the general population, especially in population health surveys, is still lacking worldwide.

Given that the SF-6Dv2 has been used in various countries [47, 48], and the Chinese version of SF-6Dv2 and its corresponding utility value set has been developed recently [49, 50], its measurement properties remain to be evaluated and compared with the EQ-5D-5L. Therefore, the aim of this study was to assess and compare the measurement properties of the SF-6Dv2 and EQ-5D-5L in a large-sample health survey among the Chinese population.


Data source

Data used in the study were obtained from the 2020 Tianjin Health Service Survey, which was conducted by Tianjin Health Commission between July and August 2020 [51]. Tianjin is one of the four municipalities of China, with a total of 16 districts and more than 15 million permanent population [52]. A multi-stage, stratified cluster random sampling strategy was used. First, five subdistricts (or townships) in each of the 16 districts were randomly selected. Second, two communities (or villages) were randomly selected within each of the 80 subdistricts (or townships). Third, 60 households were randomly selected within each of the 160 communities (or villages), and consequently, a total of 9600 households were included. All residents registered under each household were invited to participate in the survey.

Data from the 2020 Tianjin Health Service Survey were collected through three different approaches in this study to comply with the COVID-19 administrative policy in China, including face-to-face paper-based interviews at resident’s home, face-to-face paper-based interviews in publicly unified places (governmental subdistrict office or community health service center), and self-report at resident’s home. The process of the face-to-face interview was as follows. First, the respondent who was the most familiar with their family situations answered the basic questions, including the annual household medication expenditure and the distance to the closest healthcare institute from home. Second, all respondents provided a series of demographic characteristics (e.g., gender and age) and socioeconomic status (e.g., education level, marital and employment status). Third, respondents aged ≥ 15 years completed both the EQ-5D-5L and SF-6Dv2, then answered health indicator questions, including the presence of chronic diseases, presence of health examinations, and presence of illnesses in the last two weeks. Forth, questions referring to children aged < 5 years and including the number of health examinations within the past twelve months and the presence of vaccination certificates were posed to their parents. Fifth, female respondents aged 15–64 years were asked questions about the number of their children and the delivery place. Last, all respondents were asked about their knowledge and satisfaction with the hierarchical diagnosis and treatment model developed in China. Informed consent was obtained from all respondents included in the survey. Detailed information on sampling and data collection can be found elsewhere [51].

For this study, data collected in the second and third parts of the survey were used. Respondents aged < 18 years were excluded from this study since both the EQ-5D-5L and SF-6Dv2 are recommended to be used among adult respondents [20, 53]. Respondents were also required to meet the following inclusion criteria: (1) had no missing data for the EQ-5D-5L and SF-6Dv2 measures; and (2) had no missing data for the variables used in this study, including demographic characteristics, socioeconomic status, and health indicators.



The EQ-5D-5L descriptive system comprises five dimensions, namely, mobility, self-care, usual activities, pain/discomfort, and anxiety/depression, each with five levels of severity (no, slight, moderate, severe, and extreme problems). A visual analog scale (hereafter EQ VAS) using a scale ranging from 0 (worst imaginable health state) to 100 (best imaginable health state) is also included in the EQ-5D-5L [15]. The EQ-5D-5L defines 3125 (= 55) different health states according to all the possible combinations of dimension levels. The Chinese EQ-5D-5L utility value set was developed using the time trade-off (TTO) approach, with the range of − 0.391 (55,555) to 1 (11,111) [54].


The SF-6Dv2 is derived from 10 items of the SF-36. The health state classification system of SF-6Dv2 comprises six dimensions, including physical functioning, role limitation, social functioning, pain, mental health, and vitality. The pain dimension has six response levels, while all others have five levels, resulting in 18,750 (= 5*5*5*6*5*5) different health states [20]. The Chinese SF-6Dv2 value set was developed using the TTO approach, with the range of − 0.277 (555,655) to 1 (111,111) [49].

Statistical analysis

Descriptive statistics

The characteristics of respondents were described using means and standard deviations (SD) for continuous variables and frequencies and proportions for categorical variables. The distribution of response levels on each dimension of EQ-5D-5L and SF-6Dv2 was reported using histograms. Descriptive statistics (mean, SD) for the EQ-5D-5L and SF-6Dv2 utility values, and the EQ VAS scores were also computed. The EQ VAS scores were adopted as an indicator of self-reported health status, which was classified into four sub-groups: < 65 (bad), 65–79 (fair), 80–89 (good), and 90–100 (excellent) in this study [27, 41, 55].


The agreement between EQ-5D-5L and SF-6Dv2 was examined using the intraclass correlation coefficient (ICC), which was computed with the two-way mixed-effects model based on absolute agreement [56]. An ICC above 0.7 suggests an acceptable agreement [57]. Besides, given that the distributions of utility values were highly skewed, the paired comparisons between the EQ-5D-5L and SF-6Dv2 utility values were examined using Wilcoxon signed-rank test [34].

Measurement properties of the EQ-5D-5L and SF-6Dv2

The measurement properties evaluated in this study included the ceiling and floor effects, convergent validity, discriminate validity, agreement, and sensitivity of the EQ-5D-5L and SF-6Dv2.

Ceiling and floor effects

Ceiling and floor and effects for each measure were assessed by examining the percentage of respondents in the best and worst health states, respectively. These effects are considered existing if more than 15% of the respondents achieved either extreme end of the scale [58].

Convergent validity

Convergent validity refers to the extent to which an outcome of interest (such as the pain/discomfort dimension in EQ-5D-5L) shows an expected association with another similar outcome (such as the pain dimension in SF-6Dv2) measured at the same time point [30, 59]. Convergent validity was assessed by examining the correlation between EQ-5D-5L and SF-6Dv2 dimensions using Spearman’s rank correlation coefficient (r). An absolute coefficient value greater than 0.5 stands for a strong correlation, values between 0.35–0.49 for moderate, values between 0.2 and 0.34 for weak, and values smaller than 0.2 for poor correlation [28, 60].

Discriminate validity

The mean utility value of each measure was calculated and compared to evaluate the capacity to discriminate between each of the respondents’ characteristic groups. The t-tests for dichotomous variables (e.g., gender) and the one-way analyses of variance for polytomous variables (e.g., age group and body mass index [BMI] group) were used, respectively. Effect sizes (ES) were also used to define the discriminative capacity of the EQ-5D-5L and SF-6Dv2, which was calculated as the difference between the mean utility of two sub-groups divided by the pooled standard deviation [61]. For polytomous variables, the effect sizes between the extreme sub-groups (e.g., the effect sizes between the aged 18–29 sub-group and the aged ≥ 70 sub-group) were calculated [11]. The larger effect size indicates the better discriminative ability of the measures [11, 34, 36, 42, 62]. As an extended test of validity, known-group validity was used to assess the extent to which an outcome measure of interest helps distinguish between subgroups that are theoretically expected to differ [30]. Based on the published literature [34, 42, 44], we hypothesized that the elder, the female, and the obese respondents, as well as respondents with poorer self-reported health status and chronic diseases, such as hypertension and diabetes, had lower utility values.


The sensitivity of EQ-5D-5L and SF-6Dv2 for detecting differences in both external and self-reported health indicators were tested using the relative efficiency (RE) statistic. RE was determined via the ratio of the square of t-statistics from the t-tests of the comparator measure (SF-6Dv2) over that of the reference measure (EQ-5D-5L) [42, 43, 46]. A RE value of 1.0 indicates that the SF-6Dv2 has the same efficiency as EQ-5D-5L at detecting differences in these external health indicators. A value higher than 1 indicates that the SF-6Dv2 is more sensitive than the EQ-5D-5L, while a value lower than 1 means the opposite [63]. The receiver operating characteristic (ROC) curve was also used to evaluate the sensitivity of these two measures. The ROC curve provides a useful method to assess the performance of measures against external dichotomous variables of health status [64]. The area under the ROC curve (AUC) was computed to compare the discriminative power of the EQ-5D-5L and SF-6Dv2 [65]. The one that generates the larger AUC is regarded as more sensitive or effective at detecting differences, and measures with excellent discriminative ability would have an AUC score of 1.0, whereas an AUC score of 0.5 means no discriminative capacity [63]. For the current analyses, the presence of chronic diseases (i.e., hypertension and diabetes), illnesses in 2 weeks, and hospitalizations in 12 months represented the external health indicators. The self-reported health status was dichotomized as (1) excellent versus good, fair, or bad, (2) excellent or good versus fair or bad, and (3) excellent, good, or fair versus bad.

The statistical analyses were performed using STATA 15.0 (StataCorp LLC, College Station, TX, USA). All reported statistical tests were performed two-sided with a significance level of 0.05.


Descriptive statistics

Of 24,151 respondents who participated in the survey, 4974 respondents were excluded from the current analyses because they were under 18 years (N = 3754), had not completed the EQ-5D-5L or SF-6Dv2 (N = 329), or had missing values among questions included in this study (N = 891). Finally, a total of 19,177 respondents were included (Fig. 1). As shown in Table 1, 49.3% (N = 9453) of respondents were male, and the mean (SD) age was 55.2 (16.2) years, with a range from 18 to 102 years. 35.5% (N = 6806) and 13.5% (N = 2586) of respondents had hypertension and diabetes, respectively.

Fig. 1
figure 1

The flowchart of the sample inclusion for the comparison study

Table 1 Characteristics of respondents and EQ-5D-5L and SF-6Dv2 utility values (N = 19,177)

The distribution of the responses to the EQ-5D-5L and SF-6Dv2 are presented in Fig. 2. An extreme majority of the respondents indicated no problems (level 1) on at least one of the five EQ-5D-5L dimensions, with the highest proportion appearing in self-care (92.8%), followed by anxiety/depression (90.4%), usual activities (89.6%), mobility (86.5%), and pain/discomfort (77.9%). Analogously, a large proportion of respondents were also classified in level 1 on the SF-6Dv2 dimensions of mental health (77.4%), followed by social functioning (75.0%), role limitation (71.3%), pain (70.7%), vitality (63.1%), and physical functioning (46.7%).

Fig. 2
figure 2

The distribution across levels of the EQ-5D-5L and SF-6Dv2 dimensions (N = 19,177). Note: Except for the pain dimension, which has six response levels, all others have five levels, with higher values representing more severe health states

Of the total 19,177 respondents, the mean (SD) utility value of EQ-5D-5L was 0.939 (0.168), while that of SF-6Dv2 was 0.872 (0.184). The mean (SD) score of EQ VAS was 84.4 (14.0) (Table 1).


The ICC between the EQ-5D-5L and SF-6Dv2 utility values of the total sample was 0.780 (p < 0.05). Besides, the SF-6Dv2 utility values were significantly lower than those of the EQ-5D-5L (p < 0.001).

Measurement properties of the EQ-5D-5L and SF-6Dv2

Ceiling and floor effects

The proportion of respondents reporting the best state of EQ-5D-5L was 72.8% (N = 13,961), which showed strong ceiling effects, while only 0.2% (N = 35) of respondents reported the worst state. Similarly, 36.1% (N = 6921) of respondents reported the best state of SF-6Dv2, indicating a ceiling effect for the SF-6Dv2, while only 0.1% (N = 16) respondents reported the worst state.

Convergent validity

The dimensions of EQ-5D-5L and SF-6Dv2 were positively and moderately associated, with Spearman’s rank correlation coefficient ranging from 0.30 to 0.69 (p < 0.001). As expected, the EQ-5D-5L pain/discomfort dimension was strongly correlated with the SF-6Dv2 pain dimension (r = 0.69), and the EQ-5D-5L anxiety/depression dimension was highly correlated with the SF-6Dv2 mental health dimension (r = 0.52) (Additional file 1).

Discriminate validity

As reported in Table 2, both the EQ-5D-5L and SF-6Dv2 utility values were significantly different (p < 0.001) across groups defined by demographic characteristics, socioeconomic status, and health-related indicators, with effect sizes ranging from 0.061 to 2.256 for the EQ-5D-5L, and 0.126–2.675 for the SF-6Dv2. The effects sizes of the SF-6Dv2 were generally larger than the EQ-5D-5L. Moreover, the hypotheses for known-group validity were fulfilled in all tested groups (Table 2).

Table 2 Discriminative capacity and univariate analyses for EQ-5D-5L and SF-6Dv2 utility values within different groups (N = 19,177)


As shown in Table 3, the SF-6Dv2 was found to be 29.0–179.2% more efficient than the EQ-5D-5L at detecting differences in external health indicator groups, including hypertension, diabetes, other chronic diseases, illnesses in 2 weeks, and hospitalizations in 12 months. The SF-6Dv2 also had a 50.7–102.8% higher efficiency at revealing differences between self-reported health status groups dichotomized by “excellent” or “good” (Table 4). However, when the groups were dichotomized by “bad”, the EQ-5D-5L was found to be 8.2% more efficient at detecting the differences in self-reported health status (Table 4). The AUC values of both SF-6Dv2 and EQ-5D-5L were above 0.5 with statistically significant differences (p < 0.001) (Tables 3, 4). The SF-6Dv2 generated higher AUC scores than the EQ-5D-5L, indicating a possible sensitivity superiority.

Table 3 Sensitivity of EQ-5D-5L and SF-6Dv2 to detect differences in dichotomous health indicators (N = 19,177)
Table 4 Sensitivity of EQ-5D-5L and SF-6Dv2 to detect differences in dichotomous self-reported health status (N = 19,177)


Both the EQ-5D and SF-6D have been widely applied in populations with specific diseases [22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41], while evidence on the comparison of their measurement properties in the general population is still lacking. To the best of our knowledge, this study provided the first evidence of comparing the measurement properties between the EQ-5D-5L and SF-6Dv2 in a large sample of the Chinese population.

While no floor effects were observed for either the EQ-5D-5L or SF-6Dv2 (0.2% vs. 0.1%), large ceiling effects (72.8% vs. 36.1%) were found for both measures. Previous studies conducted in the general population also yielded ceiling effects of approximately 43.3–73.6% for EQ-5D-3L [11, 43,44,45,46], and 49.1–54.0% for EQ-5D-5L [42, 67], while 1.0–18.3% for SF-6Dv1 [11, 43,44,45,46]. However, the ceiling effects found in this study were relatively higher than those in previous studies. One possible reason is that the Chinese population is more unwilling to report their health problems than the Western population due to the cultural tradition [68], which was confirmed by previous studies that the Chinese population reported higher ceiling effects than the Western populations [43, 44]. Another potential reason is that the respondents included in this study were in relatively better health status. Only 8.6% of them had experienced illnesses 2 weeks before the survey, which was much less than a study conducted among the general population in Chengdu city, China [43]. Moreover, the EQ-5D-5L showed a higher ceiling effect than the SF-6Dv2 in this study, which is consistent with previous studies where the EQ-5D-5L and SF-6D were compared in both general and disease populations [23, 27, 42]. This can be partly explained by the difference in the recall period, as the SF-6D frames its questions in terms of health “over the last 4 weeks”, while “today” is used in EQ-5D. A longer recall period may provide more scopes for respondents to include small impaired issues affecting their HRQoL that might not be detected during a relatively short period [69].

The ICC value between the EQ-5D-5L and SF-6Dv2 utility values indicated a moderate agreement (ICC = 0.780). This result is higher than those found in two previous studies. In one of the two studies, the ICC between the EQ-5D-5L and the SF-6D (derived from the SF-12) was 0.510 [42]. In the other study, the ICC between the EQ-5D-3L and SF-6Dv1 was 0.536 [44]. All findings reported above suggested that the SF-6Dv2 and EQ-5D-5L showed some similarities in detecting the trend of changes in health utility values, but might be different in the absolute amount of HRQoL measured. This could be partly explained by the different dimensions covered and the different utility ranges of the two measures (− 0.391 to 1 for EQ-5D-5L vs. − 0.227 to 1 for SF-6Dv2) [49, 54]. Therefore, the utility values of the SF-6Dv2 and EQ-5D-5L may not be interchangeable.

The correlation between the EQ-5D-5L and SF-6Dv2 dimensions (r = 0.30–0.69) was also acceptable, and better than the values in the previous study which the EQ-5D-3L and SF-6Dv1 were compared (r = 0.20–0.51) [43]. Both the EQ-5D-5L and SF-6Dv2 showed those utility differences between sociodemographic and health-related groups that were expected. However, these differences tended to be more apparent for the SF-6Dv2 with larger effects sizes (ES = 0.061–2.256 for EQ-5D-5L and 0.126–2.675 for SF-6Dv2). One of the possible reasons is that the SF-6Dv2 has one more dimension, resulting in a larger descriptive system than EQ-5D-5L (18,750 vs. 3125 health states). However, this result is different from the two previous studies. One study was conducted to compare the EQ-5D-5L with the SF-6D (derived from the SF-12) in the Thai general population (ES = 0.31–1.62 for EQ-5D-5L and 0.08–0.67 for SF-6D) [42]. The other study was conducted to compare the EQ-5D-3L with the SF-6Dv1 in the Spanish general population (ES = 0.17–1.33 for EQ-5D-3L and 0.14–1.33 for SF-6Dv1) [11]. An explanation of these contrasting findings might be that the SF-6Dv2 has revised the dimension levels and could describe more health states than the SF-6Dv1 or the SF-6D derived from the SF-12. Consequently, the known group validity of the SF-6Dv2 might be improved, which has been confirmed by the previous evidence [47].

Although both the SF-6Dv2 and EQ-5D-5L showed to be sensitive and efficient in this study, some merits of each measure are still worth to be emphasized. The SF-6Dv2 was more sensitive than the EQ-5D-5L to distinguish between different external health indicators. However, when it came to the dichotomous EQ VAS based self-reported health status groups, the sensitivity of the EQ-5D-5L and SF-6Dv2 varied in terms of the different choices of “cut-off” points. The EQ-5D-5L was more sensitive for differentiating between the self-reported health status with more impaired problems. These findings are inconsistent with two previous studies, which were conducted to compare the SF-6D with EQ-5D-3L and EQ-5D-5L, respectively [42, 46]. The AUC of SF-6Dv2 (0.663–0.870) was always higher than that of EQ-5D-5L (0.605–0.833) in all tested groups. This finding is similar to the study conducted in the US general population [46], but is contrary to another study carried out in the Spanish [11], both of which were compared the EQ-5D-3L with SF-6Dv1. Thus, which of the two measures is more sensitive remains unclear. Further studies are required to provide more evidence regarding this issue.

This study has several limitations. First, the respondents were recruited in one city and the average age of them was slightly high, which may have an impact on the representativeness of the general population in China. Second, both face-to-face interviews and self-reports were used to ask the respondents to complete the questionnaire, which may affect the validity of the results of this study to some extent. Third, given the main content of the health survey, i.e., the accessibility and satisfaction with the health services, the number of the external indicators of health status were limited in this study. Fourth, this study was conducted based on cross-sectional data instead of longitudinal data. Therefore, it was not possible to evaluate and compare the test–retest reliability and longitudinal responsiveness. Further investigations using longitudinal data are required to compare the test–retest reliability and responsiveness of the SF-6Dv2 and EQ-5D-5L.


The SF-6Dv2 and EQ-5D-5L have been demonstrated to be comparably valid and sensitive when used in the Chinese population health survey. Given that the ICC value between the SF-6Dv2 and EQ-5D-5L is moderate and the utility values obtained from the two measures are systematically different, the SF-6Dv2 and EQ-5D-5L appear to be not interchangeable. Further research with a representative sample of the general population in China is needed to compare additional measurement properties of these two measures, such as test–retest reliability and longitudinal responsiveness.