Introduction

Overweight and obesity have become a major global public health issue. Rates of overweight and obesity have increased rapidly in the past four decades [1]. According to WHO statistic in 2016, more than 1.9 billion people aged ≥ 18 years are overweight around the world, of these over 650 million are obese [2]. According to the Report on Chinese Residents’ Chronic Diseases and Nutrition 2020, more than half of the Chinese adults had either overweight or obesity [1]. Overweight and obesity contributed to 11.1% of deaths associated with noncommunicable diseases (NCDs) in 2019 worldwide, with a rapid increase from 5.7% in 1990 [1]. These conditions also incurred substantial national health expenditure for the management of NCDs, and has also been shown to negatively impact health-related quality of life (HRQoL) [3].

HRQoL has been extensively used worldwide as a multidimensional concept that could be used to assess an individual’s health status based on physical, mental, and social functioning [4]. The European Medicines Agency [5] and the US Food and Drug Administration [6] have emphasized the importance of measuring HRQoL, which is considered an important piece of evidence to inform drug coverage or reimbursement decisions in many countries [7, 8]. Health-related quality of life (HRQoL) measures can be categorized as either non–preference-based or preference-based measures [9, 10]. Preference-based HRQoL measures can be used to elicit health state utility values (HSUVs) that take into account the preference on different health states by the general population and lie on a 0 to 1 (death to full health) quality-adjusted life-years (QALYs) scale [11].

Currently, the EQ-5D and the Short Form Six-Dimension (SF-6D) are the two most widely used generic preference-based measures (GPBMs) [12] and are recommended as the standard measures in the application of health technology assessment in many countries [13,14,15]. The measurement properties of the EQ-5D and SF-6D have been evaluated in the general population as well as patients with various types of diseases [16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]. These studies concluded that the EQ-5D and SF-6D were generally reliable, valid, and sensitive to measuring HSUVs in various disease populations. However, it should be noted that most of the above studies has not compared the test-retest reliability, an important psychometric property of the GPBMs. More importantly, evidence evaluating the measurement properties of the GPBMs in the overweight and obesity populations is still lacking worldwide. To the best of our knowledge, no studies have evaluated and compared the measurement properties of the EQ-5D-5L and SF-6Dv2 among overweight and obesity populations.

This study aimed to assess and compare the measurement properties of the EQ-5D-5L and SF-6Dv2 in Chinese overweight and obesity populations.

Methods

Data source

The data used for this analysis were obtained from a nationwide online survey (from Jan to Feb 2022) investigating the health status of people living with overweight or obesity in China. Recruitment of the respondents was conducted through a professional online panel company. Inclusion criteria were that respondents (1) were 18 years or older; (2) overweight (24 ≤ BMI<28) or obese (BMI ≥ 28) according to criteria of overweight and obesity for the Chinese populations [33]; (3) were literate and able to read text from a computer or mobile screen, and had no disease limiting cognitive function such as dementia; and (4) gave informed consent. A quota sampling method was also used to recruit a representative sample of the overweight and obese populations in terms of BMI, age, gender, area of residence (North, Northeast, East, Central, South, Southwest, Northwest) [34].

All eligible respondents (target N = 1,000) were invited to complete a self-reported online survey through computer or mobile phone. Information on social-demographic including ethnicity, education level, marital status, employment status, personal monthly income, health insurance coverage; health-related questions including a 5-level categorized self-reported health status (very good, good, fair, bad, very bad), presence of chronic diseases, smoking and alcohol consumption status, fruit and vegetable intake, high-fat and high-sugar food intake and weekly exercise time; and the EQ‑5D‑5L and SF-6Dv2 self-reported answers were collected. The order of the EQ‑5D‑5L and SF-6Dv2 was randomized.

A subset of respondents (target N = 150) was recruited to assess the test-retest reliability of both instruments. After the first survey (test), the interviewers randomly asked for the respondents’ consent to be online interviewed again (retest) and collected the contact information. The interval between the test and retest was set as two weeks [35, 36]. In the retest interview, respondents completed the same process as in the first interview. During the retest interview, the respondent was asked the question “Have there been any changes in your health status compared with the last interview?” and rated on a 5-level Likert scale (“no change”, “slightly change”, “some change”, “much change”, or “extremely change”). The respondents who reported “no change”, “slightly change” were regarded to have relatively stable health over the two tests and included in the data analysis [37, 38].

Measures

The EQ-5D-5L descriptive system measures health along five dimensions including mobility, self-care, usual activities, pain/discomfort and anxiety/depression. Each dimension is assessed by a single question on a five-point ordinal scale from no problem to extreme problems [39]. The other part of EQ-5D-5L is a visual analog scale (hereafter EQ VAS), which is a vertical line with endpoints of ‘‘worst imaginable health’’ at 0 and ‘‘best imaginable health’’ at 100. The EQ-5D-5L defines 3,125 unique health states, with 11111 being the best health state (full health), and 55555 the worst health state. The time trade-off (TTO) approach was used to develop the Chinese EQ-5D-5L utility value set, with utility values ranging from − 0.391 (55555) to 1 (11111) [40].

The SF-6Dv2 is a revised version of the SF-6Dv1 that is derived from 10 items of the SF-36v2. The SF-6Dv2 health state classification system measured on six dimensions, including physical functioning, role limitation, social functioning, pain, mental health, and vitality. The pain dimension has six response levels, while all others have five levels. Overall the SF-6Dv2 descriptive system can define 18,750 (= 5*5*5*6*5*5) unique health states [41]. The Chinese SF-6Dv2 value set was developed using the TTO approach, with the utility values ranged from − 0.277 (555655) to 1 (111111) [42].

Both validated Chinese versions of EQ-5D-5L and SF-6Dv2 were used in this study [32, 37].

Statistical analysis

Descriptive statistics

Descriptive statistics were used to describe the characteristics of respondents, and utility values of the two instruments. The differences between test and retest respondents’ characteristics were tested using the ANOVA for continuous variables and chi-squared test for categorical variables and presented within tables. The distribution of response levels on each dimension of the EQ-5D-5L and SF-6Dv2 was reported using histograms.

Agreement

The intraclass correlation coefficient (ICC) was used to investigate the agreement between EQ-5D-5L and SF-6Dv2. The ICC was computed with the two-way mixed-effects model based on absolute agreement [43]. An ICC above 0.7 suggests an acceptable agreement [44]. Besides, because the utility value distributions were highly skewed, the Wilcoxon signed-rank test was used to compare the utility values of the EQ-5D-5L and SF-6Dv2 [45].

Measurement properties of the EQ-5D-5L and SF-6Dv2

We focused on the aspects of ceiling and floor effects, convergent validity, known-group validity, test-retest reliability, and sensitivity that are important for assessing the performance of measurement properties of the preference-based measures.

Ceiling and floor effects. We evaluated ceiling and floor effects for the EQ-5D-5L and SF-6Dv2 by examining the percentage of respondents who reported the best and worst possible health states, respectively. Ceiling or floor effects were considered to be present if more than 15% of the respondents achieved either extreme end of the scale [46].

Convergent validity. Convergent validity was assessed by calculating Spearman’s rank coefficient (r) between the EQ-5D-5L and SF-6Dv2 dimensions. An absolute coefficient value greater than 0.5 stands for a strong correlation, values between 0.35 and 0.49 for moderate, values between 0.2 and 0.34 for weak, and values smaller than 0.2 for poor correlation [17, 32, 47].

Known-group validity. Known-group validity was used to assess the extent to which an outcome measure of interest helps distinguish between sub-groups that are theoretically expected to differ [20, 32]. Based on the published literature [32, 45, 48], it was hypothesized that the obese respondents, as well as respondents with poorer self-reported health status and more chronic diseases, had lower utility values. One-way analysis of variance (ANOVA) and Scheffe post hoc test to analyze possible differences in utility values of the EQ-5D-5L and SF-6Dv2 across different sub-groups. Besides, effect sizes (ES) were also used to define the discriminative capacity of the EQ-5D-5L and SF-6Dv2, which were calculated as the difference between the mean utility of two sub-groups divided by the pooled standard deviation. For polytomous variables, the ES between the extreme sub-groups (e.g., the ES between the sub-group with no chronic disease and the sub-group with ≥ 4 chronic diseases) were calculated [32, 48]. Generally, an ES value of 0.20 is defined as small, 0.50 as medium, and 0.80 as large.

Test-retest reliability. The test-retest reliability of the EQ-5D-5L and SF-6Dv2 was evaluated using the test and retest data by the intra-class correlation coefficient (ICC), which was computed with the two-way mixed-effects model based on absolute agreement. ICC value above 0.7 was considered as satisfactory reliability [49].

Sensitivity. The relative efficiency (RE) statistic was used to assess the sensitivity of the EQ-5D-5L and SF-6Dv2 for detecting differences in both external and self-reported health indicators. RE was calculated via the ratio of the square of t-statistics from the t-tests of the comparator measure (SF-6Dv2) over that of the reference measure (EQ-5D-5L) [50, 51]. A RE value of 1.0 indicates that the SF-6Dv2 has the same efficiency as EQ-5D-5L at detecting differences. A value higher than 1 indicates that the SF-6Dv2 is more sensitive than the EQ-5D-5L, while a value lower than 1 means the opposite [52]. The sensitivity of these two measures was also assessed using the receiver operating characteristic (ROC) curve [53]. To compare the discriminative power of the EQ-5D-5L and SF-6Dv2, the area under the ROC curve (AUC) was calculated [54]. The one with the larger AUC is thought to be more sensitive or effective at detecting differences, and measures with excellent discriminative ability would have an AUC score of 1.0, whereas measures with no discriminative capacity would have an AUC score of 0.5 [52]. The presence of representative chronic diseases, including hyperlipidemia, hypertension and diabetes, among overweight and obesity populations was used as external health indicators in the current study [55, 56]. The respondents’ self-reported health status was divided into three categories: (1) excellent versus good, fair, or bad, (2) excellent or good versus fair or bad, and (3) excellent, good, or fair versus bad.

STATA 15.0 was used for the statistical analyses (StataCorp LLC, College Station, TX, USA). All statistical tests reported were two-sided with a significance level of 0.05.

Results

Patient characteristics

A total of 9,085 potential respondents were reached out in the first round of survey (according to geographical region, gender and age quota), of which 8,259 respondents agreed to participate (the response rate was 90.9%). Among them, 7,088 respondents withdrew passively because they did not meet the BMI quota requirements (not overweight/obese [5,911] or the quota was full [1,177]), and 171 respondents voluntarily withdrew from the process of filling in the questionnaire. Finally, a total of 1,000 respondents with valid data were included in this study.

As shown in Tables 1 and 52.0% (N = 520) of respondents were male, and the mean (SD) age was 51.7 (15.3) years, with a range from 18 to 80 years, and 29.3% (N = 293) of respondents were more than 65 years old. The mean (SD) BMI of respondents was 27.4 (2.8), of which 67.7% (N = 677) were overweight with 24 ≤ BMI < 28, and 32.3% (N = 323) were obesity with BMI ≥ 28. 32.7% (N = 327), 29.2% (N = 292), and 8.9% (N = 89) of respondents had hyperlipidemia, hypertension, and diabetes, respectively.

Table 1 Characteristics of respondents

The distribution of the responses to the EQ-5D-5L and SF-6Dv2 are presented in Fig. 1. For EQ-5D-5L, 30.6% of respondents reported full health, which indicated a significant ceiling effect; while for SF-6Dv2, no ceiling effect was obverted with 2.1% of respondents reported no problems on all dimensions. No respondent reported the worst health state for both measures.

Fig. 1a
figure 1

Distribution across levels of the EQ-5D-5L dimensions

Fig. 1b
figure 2

Distribution across levels of the SF-6Dv2 dimensions

The mean (SD) EQ-5D-5L utility value among the total sample was 0.851 (0.195), ranging from − 0.184 to 1, and mean SF-6Dv2 utility was 0.734 (SD = 0.164), ranging from − 0.179 to 1. For the overweight respondents with 24 ≤ BMI < 28, mean EQ-5D-5L utility was 0.880, and mean SF-6Dv2 utility was 0.754; For the obesity respondents with BMI ≥ 28, mean EQ-5D-5L utility was 0.789, and mean SF-6Dv2 utility was 0.694.

Agreement

The ICC between the EQ-5D-5L and SF-6Dv2 utility values of the total sample was 0.639 (p < 0.001). Besides, the SF-6Dv2 utility values were significantly lower than those of the EQ-5D-5L (p < 0.001).

Measurement properties of the EQ‑5D‑5L and SF‑6Dv2

Ceiling and floor effects. A ceiling effect was found for the EQ-5D-5L, with the proportion of respondents reporting the best health state was 30.6% (N = 306), while no floor effects was observed. No ceiling or floor effects were observed in the SF-6Dv2.

Convergent validity. Most of the dimensions of EQ-5D-5L and SF-6Dv2 were positively and associated, with Spearman’s rank correlation coefficient ranging from 0.186 to 0.739 (p < 0.001); As expected, the EQ-5D-5L pain/discomfort dimension was strongly correlated with the SF-6Dv2 pain dimension (r = 0.739), and the EQ-5D-5L anxiety/depression dimension was highly correlated with the SF-6Dv2 mental health dimension (r = 0.686). The correlation between SF-6Dv2 vitality dimension and all dimensions of EQ-5D-5L was weak (Table 2).

Table 2 Correlations between EQ-5D-5L and SF-6Dv2 (N = 1,000)

Known-group validity. As reported in Table 3, both the EQ-5D-5L and SF-6Dv2 utility values were significantly different (p < 0.001) across groups defined by BMI, health status, and number of chronic diseases, with ES ranging from 0.517 to 1.885 for the EQ-5D-5L, and 0.383–2.329 for the SF-6Dv2. The hypotheses for known-group validity were fulfilled in all tested groups, that is, the obese respondents, as well as respondents with poorer self-reported health status and more chronic diseases, had lower utility values.

Table 3 Discriminative capacity and univariate analyses for EQ-5D-5L and SF-6Dv2 utility among different sub-groups (N = 1,000)

Test-retest reliability. Among 227 respondents who were invited to attend the retest interview, 220 respondents accepted the invitation with a response rate of 96.9%. 150 respondents who reported “no change” and “slightly change” in their health status compared with the last interview provided valid test–retest data. As shown in Table 1, the majority of the respondents were male (56.7%), mean (SD) age of 50.6 (15.1) years. Except for marital status, no significant difference was obverted in basic characteristics between the 150 respondents and total sample. Both instruments showed good test-retest reliability. For the EQ-5D-5L, the overall ICC was 0.939 (95% CI 0.917, 0.955), where for overweight was 0.933 (95% CI 0.903, 0.954), and obese was 0.941 (95% CI 0.890, 0.969). For the SF-6Dv2, the overall ICC was 0.972 (95% CI 0.962, 0.980), where overweight was 0.980 (95% CI 0.971, 0.986), and obese was 0.954 (95% CI 0.916, 0.975).

Sensitivity. As shown in Table 4, the SF-6Dv2 had 3.7-170.1% higher efficiency at revealing differences between self-reported health status groups dichotomized by “excellent”, “good” or “bad”. The SF-6Dv2 was also found to be 26.1% and 44.7% more efficient than the EQ-5D-5L at detecting differences in external health indicator hyperlipidemia and hypertension groups, respectively. However, when the groups were dichotomized by “diabetes” and “non-diabetes”, the EQ-5D-5L was found to be 16.6% more efficient at detecting differences in external health indicator groups (Table 5). The AUC values of both SF-6Dv2 and EQ- 5D-5L were above 0.5 with statistically significant differences (p < 0.001) (Tables 4 and 5). The SF-6Dv2 generated higher AUC scores than the EQ-5D-5L, indicating a possible sensitivity superiority.

Table 4 Sensitivity of EQ-5D-5L and SF-6Dv2 to detect differences in different self-reported health status groups (N = 1,000)
Table 5 Sensitivity of EQ-5D-5L and SF-6Dv2 to detect differences in different chronic diseases groups (N = 1,000)

Discussion

To the best of our knowledge, this study provided the first evidence of comparing the measurement properties between the EQ-5D-5L and SF-6Dv2 in a large sample of the Chinese overweight and obesity populations. This study could facilitate medical or public health professionals and regulators to understand and select the appropriate measure to make decisions in overweight and obesity clinical interventions and policies.

The EQ-5D-5L showed an higher ceiling effect than the SF-6Dv2 in this study (30.6% vs. 2.1%), which is consistent with previous studies where the EQ-5D-5L and SF-6D were compared in both general and disease populations [18, 32, 57, 58]. This can be partly explained by the difference in the recall period, as the SF-6D frames its questions in terms of health “over the last 4 weeks”, while “today” is used in EQ-5D. A longer recall period may provide more scopes for respondents to include small impaired issues affecting their HRQoL that might not be detected during a relatively short period [59]. Another justification might be a strong relationship with the dimensions and items measured [32, 37].

Both the EQ-5D-5L and SF-6Dv2 were found to have an acceptable reliability and internal consistency. The SF-6Dv2 (ICC = 0.972) performs better than EQ-5D-5L (0.939) in terms of test-retest reliability, implying SF-6Dv2 has ability to produce reproducible results from patients if the instrument is used repeatedly within a short period of time. This finding appears to be consistent with one previous study [60]. Regarding convergent validity, as expected, only the EQ-5D-5L pain/discomfort and anxiety/depression dimensions were strongly correlated with the SF-6Dv2 pain and mental health dimension. The correlation between the SF-6Dv2 vitality dimension and all dimensions of EQ-5D-5L were weak. A possible reason for this could be the fact that the EQ-5D-5L has four out of five items assessing physical health, whereas the SF-6D consists of a balanced number of physical and mental items. Our findings are consistent with previous studies [28, 61], implying that the EQ-5D-5L is appropriate for applying to patients with more physical problems than those with mental or psychological problems.

Known-group validity indicated that both the EQ-5D-5L and SF-6Dv2 were able to discriminate between populations with different levels of self-reported health status and different number of chronic diseases that were expected. These differences tended to be more apparent for the SF-6Dv2 with larger effects sizes (ES = 1.717–1.885 for EQ-5D-5L and 2.076–2.329 for SF-6Dv2). One of the possible reasons is that the SF-6Dv2 has one more dimension, resulting in a larger descriptive system than EQ-5D-5L (18,750 vs. 3,125 health states). This result was consistent with one previous study, which found that the SF-6D in general showed better sensitivity and construct validity than the EQ-5D-5L in seven diseases [62]. Moreover, although the hypotheses for known-group validity were fulfilled in all tested groups, this study found that both instruments were not sensitive enough (ES < 0.8) to differentiate overweight and obesity respondents in different degrees of severity. This may be explained because the GPBMs may be insensitive to measure specific diseases [63]. More evidence is warranted to assess the use of GPBMs among overweight and obesity populations.

RE and ROC analysis showed that the SF-6Dv2 was more efficient to detect differences between self-reported health status groups, while the EQ-5D-5L was found to be more efficient than the SF-6Dv2 at detecting differences in external health indicator groups. The AUC of SF-6Dv2 (0.775–0.881) was always higher than that of EQ-5D-5L (0.754–0.862) in all tested groups. Possible reasons for this may be related to the differences in the recall period, and the number of dimensions between the two instruments. These findings are consistent with previous studies, which were conducted to compare the SF-6Dv1 or SF-6Dv2 with EQ-5D-5L in the general population and patients with some other types of diseases, and concluded that both instruments are sensitive to different groups [30, 32, 64].

Several limitations of this study should be addressed. First, we only focused on adults while did not include adolescents with high prevalence of overweight and obesity, which may have an impact on the representativeness of overweight and obesity in China. Second, online survey was used in this study, which may affect the quality of collected data. While this concern was addressed by monitoring IP addresses and response time of respondents to ensure the authenticity and validity of the collected data. Third, although we conducted the test-retest based on the longitudinal data, the follow-up duration was relative short to evaluate and compare the responsiveness of EQ-5D-5L and SF-6Dv2. Further research is warranted to compare the responsiveness. Besides, in order to reach a satisfied sample size, respondents who reported “no change”, “slightly change” were regarded to have relatively stable health over the two tests and included in the data analysis. This may have an impact on the test-retest reliability analysis.

Conclusions

Both the EQ-5D-5L and SF-6Dv2 are psychometrically sound instruments with satisfactory validity, reliability, and sensitivity in measuring the HRQoL of Chinese overweight and obesity populations. While these two measures cannot generally be used interchangeably given the ICC value between the SF-6Dv2 and EQ-5D-5L is moderate and the utility values obtained from the two measures are systematically different.