Introduction

Generic preference-based measures EQ-5D and SF-6D are two well-known and widely used instruments to measure health-related quality of life (HRQOL), which can be converted into health utility scores (HUS), in clinical trial, economic evaluation or population health survey [1,2,3,4]. For example, EQ-5D was the HRQOL instrument in the China National Health Service Survey launched since 2008 [2]. EQ-5D and SF-6D both were recommended instruments for utility measurement for health economic evaluation by China Pharmaceutical Economics Evaluation Guidelines (2020 edition) [3].

The EQ-5D, which was developed by EuroQol in 1996 [5]. It currently has two versions, EQ-5D-3L and EQ-5D-5L, both of which include five dimensions.The original version EQ-5D-3L, categorizes the five dimensions into three severity levels: no problems, moderate problems, and extreme problems, capable of defining 243 (35) unique health states. Because the EQ-5D-3L was found to be insensitive to mild or even moderate differences in HRQOL and greatly limited by the ceiling effect [6], EQ-5D-5L was developed in 2009 with two more levels in each dimension (slight problems and severe problems) to categorize health status [7]. The EQ-5D health states can be converted into HUS, which anchors at 1 (full health) and 0 (death), following country or population specific value sets. As expected, EQ-5D-5L has demonstrated better measurement properties than EQ-5D-3L [7,8,9]. The SF-6D, which was based on the SF-36 was developed by Brazier et al. in 2002 [10]. The SF-6D has two versions (SF-6DV1 & SF-6DV2) corresponding to the two versions of SF-36 [11]. SF-6DV1 has the disadvantages of unclear severity ordering of dimensions and limited sensitivity [12]. SF-6DV2 addresses them by simplifying level descriptions and providing clearer wording, and is thus with better reliability and validity [11, 13,14,15]. Similarly, each SF-6D health state can be translated into a HUS based on a certain value set for SF-6D.

Although both EQ-5D and SF-6D measure the same concept of HRQOL and provide HUS, their measurement performance was not the same in different populations, such as general populations in Asia [16,17,18,19,20,21,22]. For example, a study in China general population suggested that SF-6DV2 is more sensitive in distinguishing participants with and without chronic diseases [22]; while a study in Thailand general population found a better sensitivity of EQ-5D-5L in distinguishing participants with different in characteristics gender, age, education level, household income, and number of diseases [17]. The studies generally not mentioned the use order of the two instruments, which is an important factor influencing the comparison results. In addition, no study has compared their performance in university staff and students.

Recently, an increasing number of studies have begun to measure the Health-Related Quality of Life (HRQOL) of university staff and students. Both populations are under great pressure of occupational overload and employment-related school performance, respectively [23,24,25,26]. This would adversely affect their physical and mental health and consequently HRQOL [23, 25]. Indeed, a study has shown that university staff have lower HRQOL than the general population in China [24]. However, as there are currently no specific measurement tools developed for assessing the HRQOL of university staff or students, the EQ-5D or SF-6D are commonly employed for this purpose [27,28,29,30]. Hence, it is important to select the most appropriate measurement tool based on different populations in use. This study thus aimed to compare the measurement properties between EQ-5D-5L and SF-6DV2 in university staff and students in China by randomly assigning their use order.

Methods

Study design and population

This is a web-based health survey targeted at the highly-educated populations, i.e., university staff and students currently working or studying in one of public universities. The questionnaire was distributed through the largest online survey platform in China, Wen Juan Xing (Changsha Ranxing Information Technology Co.,Ltd., Hunan, China). Wen Juan Xing, equivalent to Qualtrics, Survey Monkey or Cloud Research, provides online questionnaire design and survey functions for the customers. The study took a snowballing sampling method with a convenient sample composed of colleagues, friends and acquaintances. Then the questionnaire was circulated via Wechat working groups, personal invitation and unofficial announcement by the existing respondents among study population. The participation was completely voluntary and incentives were not provided in any form. The study was approved by the IRB committee of the Air Force Medical Center in Beijing (KongTe: NO 2021-169-PJ01).

Data collection

The online questionnaire collected variables about health determinants such as demographic (age, gender, height, weight), lifestyle or behavioral (smoking, drinking) and socioeconomic (education, marital status) factors. Additionally, the conditions (diseases, symptoms, discomforts) which can directly influence individual HRQOL were systematically collected. In answering the online questionnaire, either EQ-5D-5L or SF-6DV2 was randomly assigned first and then followed by the other. This is to eliminate the ordering-effect when measuring the same property with different instruments.

EQ-5D-5L

EQ-5D-5L inquires an individual’s HRQOL on the day of survey using two parts: a health-state descriptive system and a visual analog scale (EQ-VAS). The system includes five dimensions:mobility (MO), self-care (SC), usual activities (UA), pain/discomfort (PD), and anxiety/depression (AD). It measures 3125 health states in total, each expressed in five-digit numbers for EQ-5D-5L, combining the levels of five dimension each [31]. For example, EQ-5D-5L “52341” means extreme problems in mobility, slight problems in self-care, moderate problems in usual activities, severe pain/discomfort, and not anxious/depressed. In this study, we used the Chinese EQ-5D-5 L value set developed by Luo et al. to calculate EQ-5D-5L HUS [32] (Table1). The HUS for EQ-5D-5L state “52341” is 0.248. EQ-VAS is a 20 cm vertical visual scale, ranging from 0 (worst imaginable health) to 100 (best imaginable health), and reflecting the respondents’ self-rated overall health status [33].

Table 1 Characteristics of EQ-5D-5L and SF-6DV2

SF-6DV2

SF-6DV2 assess HRQOL of individuals covering the last 4 weeks in six dimensions i.e.physical functioning (PF), role limitation (RL), social functioning (SF), pain (PN), mental health (MH), and vitality(VT), which have 5–6 functioning levels. SF-6DV2 can measure 18,750 health states, each of which is indicated by a six-digit number combining the levels in six dimensions. “312654” means your health limits you a little in moderate activities, you have no problems with your work or other regular daily activities as a result of your physical health or other activities as a result of your physical health or any emotional problems, your health limits your social activities a little of the time, you have pain extremely, you feel tense or downhearted and low all of the time, and you have a lot of energy a little of the time. The SF-6DV2 value set in China developed by Wu et al. was used in the study [15] (Table1). According to it, the HUS for SF-6DV2 state “312654” is 0.204.

Statistical analysis

Descriptive statistic was conducted to depict respondent characteristics, the response distribution to the EQ-5D-5L and SF-6DV2 dimensions, their HUSs, EQ-VAS score, and the overall ceiling effects (the proportion of no problems in all the dimensions). Continuous variables were expressed as mean and standard deviation (SD), and categorical variables as frequency and percentage.

The agreement between EQ-5D-5L and SF-6DV2 HUSs was tested by intra-class correlation coefficient (ICC), which was computed with the two-way mixed effects model based on absolute agreement. ICC ranges from 0 to 1 and a value < 0.5, 0.5–0.75, and > 0.75 indicate excellent agreement poor, moderate, and good agreement, respectively [34,35,36]. Bland-Altman plots were also constructed to visually examine the utility differences of two instruments. The agreement is deemed perfect if the between-instrument differences have a mean of 0 and randomly scatter within the 1.96 SD around the mean [37, 38].

Convergent validity of EQ-5D-5L and SF-6DV2 similar dimensions (i.e., MO and SC vs. PF, UA and RL vs. SF, PD vs. PN, AD vs. MH) (Appendix 1) and their HUSs were evaluated by using the Spearman’s correlation coefficient (r): >0.5 (strong correlation), 0.35–0.5 (moderate correlation), 0.20–0.35 (weak correlation), and < 0.20 (poor correlation) [39].

Known-groups validity of EQ-5D-5L and SF-6DV2 HUSs was assessed by testing their ability in identifying different subgroups with known differences in health status. Following that, the sample have been classified independently according to the self-reported clinical conditions, i.e., disease, symptom or discomfort in 12 months, injury in 12 months, and number of the three conditions. Those with the condition or more conditions were believed to have worse health status than their respective counterparts. The p-value of the F test in ANOVA test was used as the indicator. Their sensitivity was compared using relative efficiency (RE), which was calculated based on the ratio of F-statistic values [40]. A higher RE indicates a better ability to detect statistically significant difference between subgroups. In this study, the F-statistic of EQ-5D-5L HUS was used as the reference to calculate the RE of SF-6DV2 HUS and EQ-VAS score. As a result, RE <1 means EQ-5D HUS is more effective.

Data were analyzed using SPSS 26.0 and STATA 17.0 software. All the analyses were two-sided and tested with a significance level of p < 0.05.

Results

Characteristics of the two samples

There were 474 respondents among which 291 were university staff. The student sample enrolled 99 undergraduates and 84 postgraduates (Table2). The mean ages of staff and students were 39 (9.6) years and 25.0 years (8.5) respectively. The faculty had slightly more males (55.3%) while student sample got more females (54.6%). The proportion of smoking habit was below 15% in both samples. The mean BMIs were 21.6(3.4) and 23.6(4.11) for staff and students respectively. A bigger proportion of students (54.6%) maintained normal BMI than faculty (54.6%). Compared to the students, staff reported higher prevalences of diseases and symptom/discomfort than the students but lower prevalence of injuries in the past year. These are expected as staff were older while students were more active and risk-taking.

Table 2 Characteristics of University staff and students (N = 474)

HRQOL profile

As shown in Table3, EQ-5D-5L was affected by the high ceiling effect that was 43.3% in measuring staff and 51.4% in measuring students. More than 92% of respondents reported “no problems” on “Mobility”, “Self-care” and “Usual activities” in both samples as they were generally considered healthy and able to carry out daily tasks. With regard to the PD and AD, 41.6% of staff and 30.1% of students reported problems on these two dimensions. The similar response distributions were observed in two samples while staff systematically reported more problems than students. Accordingly, the mean HUS of staff was 0.92 (0.11) which was lower than 0.95 (0.08) for students, and the mean VAS of staff was 77.5 (14.8) which was also lower than the students 84.5 (14.4).

Table 3 Distributions of responses to each of the EQ-5D-5L dimension in the two samples

The response distributions of SF-6DV2 exhibited a different pattern from EQ-5D-5L (Table4). Either university staff or students were found by SF-6DV2 to have more problems than by EQ-5D-5L. Less than half of staff sample reported “no problems” across each six dimensions. The worst was the VT dimension where only 30 (10.3%) faculty member did not feel tired in the past four weeks. Similar to the staff, students also had most problems in the VT dimension. Like EQ-5D-5L, students showed better HRQoL profile than staff. With more problems detected by SF-6DV2, the mean HUS in SF-6DV2 was 0.76 (0.14) in staff, which was 0.16 significantly lower than that derived by EQ-5D-5L. Likewise, the mean HUS of students was also significantly lower at 0.82 when measured by SF-6DV2 than EQ-5D-5L. The latter derived a mean HUS of 0.95 for students. The ceiling effects associated with the SF-6DV2 was lower than the EQ-5D-5L with 7.6% vs. 43.3% and 20.2% vs. 51.4% respectively in staff and students respectively.

Table 4 Distributions of responses to each of the SF-6DV2 dimension in the two samples

For the students, the EQ-VAS and EQ-5D-5L HUS scores were severely skewed while the SF-6DV2 appeared to follow a uniform distribution. The skewness being -2.027, -3.035 and -0.359 for EQ-VAS, EQ-5D-5L and SF-6DV2 HUS all followed left-skewed distribution. SF-6DV2 HUS was more evenly distributed than the other two scores. While the HUS of EQ-5D-5L was more concentrated between 0.8 and 1.0; EQ-VAS was mainly concentrated on 80–100 (Fig.1).

Fig. 1
figure 1

Distribution of EQ-VAS score,EQ-5D-5L,and SF-6DV2 utility scores in the two samples

Agreement between the EQ-5D-5L and SF-6DV2 utility scores

The HUSs of two instruments were in moderate agreement with the ICCs being 0.567 and 0.553 for the staff and the students respectively. The agreement displayed by the Bland-Altman appeared to confirm this. The two samples Bland-Altman analysis all showed that over 95% points were within the limits of agreements (University staff: 99.95%; (University student: 99.97%). The HUSs by EQ-5D-5L were normally higher than those measured by the SF-6DV2. But in cases where subjects had low HUSs (< 0.6), EQ-5D-5L produced lower HUS than SF-6DV2. This observation appeared in both samples (Fig.2).

Fig. 2
figure 2

Bland-Altman plot of the EQ-5D-5L and SF-6DV2 utility scores in the two samples

Construct validity

According to Table5, several similar dimensions of EQ-5D-5L and SF-6DV2 failed to show good convergent as theoretically expected. Specifically, SF-6DV2 PF correlated weakly with EQ-5D-5L MO and SC dimensions in both samples. However, the EQ-5D-5L PD and AD dimensions showed strong correlations with the similar dimensions of the SF-6DV2 PN and MH, respectively. The correlation coefficients were 0.748 and 0.563 among staff; and 0.623 and 0.645 among students. Discriminant validity was suggested that SF-6DV2 MH dimension was not significantly correlated with the pure physical constructs, MO, SC or UA, of EQ-5D. What was noteworthy was that SF-6DV2 RL and VT constructs tended to had stronger and more significant correlations with PD and AD dimensions of EQ-5D-5L, rather than MO, SC or UA. This followed the previous report that SF-6DV2 is more socially oriented whereas EQ-5D-5L is more physically oriented.

Table 5 Correlations of the dimensions of EQ-5D-5L and SF-6DV2 in the two samples

Correlations between EQ-VAS score, EQ-5D-5L and SF-6DV2 HUSs are shown in Table6. For university teachers, the coefficients were 0.592 (EQ-VAS and EQ-5D-5L HUS), 0.570 (EQ-VAS and SF-6DV2 HUS), and 0.692 (EQ-5D-5L and SF-6DV2 HUSs), respectively, all indicating a strong correlation. For the students, the coefficients were 0.421 (EQ-VAS and EQ-5D-5L HUS), 0.442 (EQ-VAS and SF-6DV2 HUS), and 0.703 (EQ-5D-5L and SF-6DV2 HUSs), respectively. Among them, EQ-5D-5L and SF-6DV2 HUSs have stronger correlation in both samples.

Table 6 Correlation of EQ-VAS score, EQ-5D-5L and SF-6DV2 utility scores in the two samples

Known-groups validity and sensitivity of the utility scores

The results of known-groups validity and sensitivity for EQ-VAS score, EQ-5D-5L and SF-6DV2 HUSs are shown in Table7. Among university staff, EQ-VAS score, EQ-5D-5L and SF-6DV2 HUSs all found significant differences for two known-groups (with and without disease, with and without symptom or discomfort, and number of any of the three conditions). SF-6DV2 HUS was more efficient than EQ-5D-5L HUS and EQ-VAS score in detecting the three conditions (RE > 1 for both). On the other hand, EQ-5D-5L HUS and EQ-VAS score were more sensitive than SF-6DV2 HUS in identifying the staff with and without injury. EQ-5D-5L HUS was also more discriminative than EQ-VAS score for two known-groups (i.e., with and without symptom or discomfort, and with and without injury). In contrast, EQ-VAS score was more discriminative than EQ-5D-5L HUS in the two known-groups (with and without disease, and number of any of the three conditions).

Table 7 Known-groups validity and sensitivity of EQ-VAS score, EQ-5D-5L and SF-6DV2 utility scores in the two samples

Among the students, both EQ-5D-5L and SF-6DV2 HUSs could detect significant difference in all the known-groups. And EQ-5D-5L HUS was found to be better efficient than SF-6DV2 and EQ-VAS in detecting differences in two known-groups (with and without symptom or discomfort, and number of any of the three conditions) (RE < 1), while SF-6DV2 HUS was better efficient than EQ-5D-5L HUS in detecting disease and injury (RE > 1 for both).

Discussion

Measurement performance of GPBMs varied a great deal across populations and GPBM instruments were normally not interchangeable. This phenomenon has necessitated the research on the psychometric performance of even widely-used GPBM in specific populations and decision-making settings. This study investigated the psychometric properties of EQ-5D-5L and SF-6DV2 of two samples populations living a life in the higher-education sector. The results showed that the EQ-5D-5L and SF-6DV2 HUSs had acceptable convergent validity and known-groups validity. Nevertheless, only EQ-5D-5L (PD and AD) and SF-6DV2 (PN and MH) showed the expected good convergent validity. Although HUSs of two questionnaires were in moderate agreement, they were not be interchangeable. The SF-6DV2 seems to be preferred in the study populations as it displayed a lower ceiling effect and better distributional property than the EQ-5D-5L.

Similar to the previous findings [17, 40], the EQ-5D-5L systematically yielded higher HUS than SF-6DV2 in both staff and students The HUS differences of 0.17 and 0.13 respectively for staff and students reached the statistical significances. This may suggest that EQ-5D-5L has overestimated the health status given its ceiling effect 5.7 and 2.54 times that of SF-6DV2. It further illustrates an important issue that the choice of HRQOL measurement tool would substantially affect the decision-making about resource allocation in the context of higher education. Two reasons could account for the differences. First, the EQ-5D-5L utility is determined by the self-ranked health status on the day of survey while the SF-6DV2 covers a longer period of health status over the past four weeks. Thus, the SF-6DV2 theoretically has captured more health-related problems than the EQ-5D [22]. For example, a respondent could be free of pain/discomfort on a single day but may have experienced it some time in the past four weeks. Second, the SF-6DV2 HUS has unique contribution for the dimension Vitality, which would reflect extra HRQOL impairment.

The overall agreement of the HUSs was moderate between the two instruments with ICC being 0.567 and 0.553. The value is lower than the ICC discovered in a sample (n = 19,177) drawn from the general population (ICC = 0.75) [15]. The visual inspection of agreement by Bland-Altman plots (Fig.2) demonstrated not only the systematically higher utility of EQ-5D-5L relative to SF-6DV2, but also some consistency. These findings are in line with prior results [41, 42]. The plots showed that, in the lower range of HUS, HUS differences increased, and interestingly, the EQ-5D-5L produced lower HUS than the SF-6DV2 when the HUS below certain threshold. This could be attributed to the difference in the utility scoring functions of the two instruments. That is, the difference in coefficients of the two functions is in general increased along the health severity; and the EQ-5D-5L scoring function tends to generate lower HUS for health states with severe or very severe problems (Table1).

Regarding the construct validity, strong correlations were observed between the similar dimensions (PD and PN; AD and MH) of EQ-5D-5L and SF-6DV2. However, the correlations were not as strong for other theoretically similar dimensions (MO/SC with PF in both samples, UA and SF among university staff). The finding in line with the results of a general population study in China [20], which may also be attributed to the different connotations of the similar dimensions. For instance, EQ-5D-5L MO/SC both involve simple activities (walking, bathing or dressing), whereas SF-6DV2 PF includes both high-intensity and moderate-intensity activities (running, lifting a table, etc.). In addition, the EQ-5D-5L puts emphasis on physical functions, while SF-6DV2 is more socially related [5, 43, 44]. In reality, usual activities can be performed without social contacts. So the UA dimension of EQ-5D-5L may not be strongly correlated with the SF dimension of SF-6DV2 in our case. In addition, we find that correlations seem to occur more frequently in the employee sample than in the student sample.There may be two possible reasons. Firstly, university staff have worse health than the students, hence the two instruments tend to converge in identifying the same HRQOL problems. Secondly, university students are likely to have a greater variety of daily activities than staff, and therefore PF correlates slightly less with MO and SC in university students than in university staff. The staff-student difference was supposed to be related to both occupations and age. So far the evidence is rare directly comparing the HRQOL of university staff and students. However, studies have shown that the younger age is associate with better health status [45, 46]. On the other side, there exists research indicating the suboptimal HRQOL of college teachers from an occupational health perspective [23]. We also found strong correlations among the three overall health indicators, showing convergent validity of two instruments in our study population. Furthermore, the degree of correlation between the HUSs was stronger than their correlations with EQ-VAS score. This may be due to the fact that both the two HUSs reflect health preferences of general Chinese population. This is different from the result of Thai general population that the correlation between EQ-5D-5L HUS and EQ-VAS score was stronger than their correlations with SF-6DV2 HUS [17]. The study utilized the SF-6D value set in the UK and the EQ-5D value set in Thailand, which may explain the differences in results.

In terms of known-groups validity, the HUSs and EQ-VAS scores have discriminated the majority of groups with known difference in health states, supporting that discriminant validity of the questionnaires. However, the exception occurred to the injury condition, for which both HUSs and EQ-VAS appeared weak to discriminate staff with or without injury. This finding may be attributed to two factors. Firstly, co-existence of disease and symptom/discomfort on the staff without injury in the last 12 months (75 university staff). Secondly, injury of university staff being minor. Meanwhile, they had different sensitivity in distinguishing the difference in HRQOL between the known-groups. The EQ-5D-5L and SF-6DV2 HUSs are generally better than the EQ-VAS score. One potential reason is that the two HUSs are based on the information on the five or six health aspects while the EQ-VAS score reflects the global health of an individual which is insensitive to health impairment in a certain dimension. This is similar to the finding in depressed patients that SF-6D HUS had better sensitivity over EQ-VAS score [47]. With regard to the sensitivity of the two HUSs, we found that either of them is not superior to the other. Previous studies also reported inconsistent findings in general populations in Asia [17, 48]. Apart from the differences in study design, population, method (e.g., the order of two instruments), the finding could also be due to their scoring functions: the EQ-5D-5L function apts to yield lower HUS for severe health states offsetting the advantage in descriptive system of SF-6D (more dimensions).

The strength of our study is the randomized assignment of the two instruments thus avoiding the order effect [49]. Our study also has two limitations. First, it is a cross-sectional study thus the test-retest reliability and responsiveness cannot be assessed. Second, HRQOL were collected by participants completing the paper version of instruments online. This practice might have affected the quality of data. Nevertheless, the participants are highly-educated and familiar with internet use, which would ensure the validity and reliability of HRQOL to a large extent.

Conclusion

In conclusion, it appears that both EQ-5D-5L and SF-6DV2 HUSs have acceptable measurement properties including convergent validity, known-groups validity, sensitivity in Chinese university staff and students. However, only EQ-5D-5L (PD and AD) and SF-6DV2 (PN and MH) demonstrated the anticipated good convergent validity. Future studies are warranted to further evaluate other measurement properties such as test-retest reliability and responsiveness of the two instruments in the populations.