A total of 2216 potentially relevant articles were identified after removing duplicates. Title and abstract screening excluded 1661 and 465 records, respectively, and full text screening excluded an additional 63. Online search and reference screening found 3 reviews that had not been detected by database searches. Consequently, 30 reviews were included [17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46]. Figure 1 summarizes the selection process. A list of included and excluded reviews is provided in Appendix Tables 4 and 5.
Characteristics of the included reviews
The number of studies included in the reviews varied significantly,Footnote 1 from five [38] to 122 [39]. Most reviews included a mix of randomized clinical trials (RCTs), cross-sectional, cohort and longitudinal studies, or a mix of other experimental and/or observational designs, apart from Devine et al. [38] which focused on longitudinal studies and Holloway et al. [45] which focused on RCTs. One review by Bansback et al. [43] included only economic evaluations. Table 1 summarizes the main characteristics of the included reviews.
Table 1 Characteristics of the included reviews
Quality of included reviews
Two reviews [24, 40] received an assessment of excellent quality and 14 of good quality [17,18,19,20,21, 23, 25,26,27,28, 30, 32, 33, 36]. The remaining 14 reviews received a poor quality assessment [22, 29, 31, 34, 35, 37,38,39, 41,42,43,44,45,46]. The main reason for poor quality was that reviews did not assess the quality of the included papers themselves and, consequently, did not consider scientific quality appropriately in drawing conclusions. Five reviews received an AMSTAR modified score below 3, with four of them reporting a literature search that was not considered comprehensive (i.e. terms were not derived paying attention to synonyms, acronyms and related terms for the building blocks of the research question) [29, 37, 42, 44] and none of these performed a double-blind study selection [29, 37, 42, 44, 46].
Breadth and depth of the evidence
Twenty-nine reviews reported information for the EQ-5D, twelve for the SF-6D, eight for the HUI3, two for the 15D and three for the AQoL 8 dimensions.
EQ-5D psychometric characteristics were presented for conditions across 16 ICD classes of disease codes (Table 2). Two reviews reported EQ-5D characteristics in a class not specified (i.e. aesthetic surgery in Ching [44] and older population in Haywood [36]). SF-6D psychometric performance was reported for conditions related to 9 classes of disease, HUI3 to 7 classes, and 15 D and AQoL only to 2 classes of disease.
Table 2 Main EQ-5D, SF-6D and HUI3 results
The amount of evidence in relation to the psychometric assessment of validity and responsiveness within conditions varied substantially, with some reviews reporting multiple psychometric analysis results and others focusing on a single type of assessment. Overall there was much less evidence available for measures other than the EQ-5D.
Type of evidence
Known groups testing
Of the 180 studies included in the systematic reviews that reported known groups validity, 77 used comparisons based on severity traits although two studies did not use all the potential severity levels [29, 34]. For the other studies comparisons were based on patients versus general population (44 studies), different types of diseases or disorders (15 studies), groups defined by an HRQoL instrument (7 studies), numbers of diseases/disorders (4 studies) and patients with or without complications (3 studies). Comparisons were also based on other groups such as discharged and not discharged patients (21 studies). Nine studies used groups that were considered inappropriate for testing GPBM validity, like age, education, different country cohorts and income. Most studies assessed known groups based on utility scores, but seven reviews [21, 24,25,26, 28, 30, 32] reported results for unscored dimensions of the instruments.
Convergent validity
Correlations with other measures were reported in 135 studies, 38 of which used a non-preference-based HRQoL measure, 32 a direct utility measure (e.g. TTO), 27 a symptom or severity measure, 18 a functional status measure, 9 another GPBM and 14 did not specify the measure used.
Responsiveness
Reviews reported 172 studies on GPBM responsiveness, most of which (n = 124) were based on comparing patients before and after a successful treatment, with 112 of these reporting statistically significant differences, 8 reporting SESs, 2 reporting SRMs and 2 not reporting the method employed. Comparisons were also based on patient groups receiving different treatments (n = 38; 32 reporting statistical significance and 6 reporting SESs), and patients reporting an improved health state (n = 6; 3 reporting SESs and 3 reporting SRMs). Four did not specify the groups used, but reported SRMs.
Performance of instruments by condition
The overwhelming majority of evidence in type 1 [23] and 2 [23, 42] diabetes mellitus showed that EQ-5D possessed good discrimination between severity groups, correlated moderately to strongly with other HRQoL instruments and reported changes consistent with expectations after patients’ treatment. Little evidence was found for the SF-6D, and this was mixed.
The review on diseases of the skin and subcutaneous tissues [28] (including psoriasis, acne, eczema and leg ulcers) presented results supporting EQ-5D validity and responsiveness, with only 2 out of 27 studies reporting evidence against the measure’s validity, which were weak correlations and lower SRMs for EQ-5D compared to other measures.
Two systematic reviews investigated COPD and asthma [29, 31], suggesting that the EQ-5D is generally valid based on known group comparisons of severity and patients/general population groups and correlations between the EQ-5D and non-preference-based HRQoL measures. Results for responsiveness were mixed, with two studies reporting weak SRMs of the measure, one study strong SRMs and four showing changes in the expected direction using SESs and statistical significance. The only comparative study across GPBMs reported poor correlations between EQ-5D and SF-6D.
One review each investigated the performance of the EQ-5D in urinary incontinence [21] and HIV [33]. There was evidence of validity and responsiveness in urinary incontinence [21] with five studies supporting discriminative validity based on severity levels and type of urinary incontinence, seven reporting moderate to strong correlations with HRQoL and symptom and severity measures, and five showing differences in health status from baseline to follow-up and between treatment arms. Two studies reported mixed results, one showing that the EQ-5D distinguished between some types of urinary incontinences but not others, and the other that the EQ-5D detected treatment differences only for some groups of patients, where other measures registered changes for all treatment groups. Two studies had inconclusive results for convergent validity as they did not specify the strength of correlations between measures. One study reported results for other GPBMs, supporting SF-6D, 15D and AQoL known group validity based on the assessment of severity traits. In HIV [33] responsiveness of the EQ-5D was weak, showing generally small before and after treatment SESs in the presence of moderate or large ESs for the comparator measures. The only study investigating construct validity reported a good ability of the measure to discriminate between known groups.
The EQ-5D appeared generally valid and responsive in a number of cancers [25, 31] (including lung, breast, cervical, colon, kidney, liver cancer and leukemia) although limitations were found in some studies. Twenty-five of the 31 studies examining known group differences showed that EQ-5D distinguished between cancer severities, patients/general population and groups with different types of cancer; 12 of the 17 studies examining convergent validity reported moderate to strong correlations with direct utility measures, HRQoL measures and functional status measures; and 29 of 43 studies examining responsiveness showed that the measure detected changes between treatment arms and from baseline to follow-up that were consistent with those of comparator measures. A significant amount of evidence supported HUI3 psychometric characteristics [25, 31] with 8 studies out of 11 showing good discriminative ability in distinguishing between severity levels, type of cancer and cancer patients/general population, 4 studies out of 7 reporting good convergence with functional status measures and 8 studies out of 10 a good ability to detect changes from baseline and between treatment arms. Only two studies reported information for the SF-6D. In one, the measure was not able to detect differences between cancer patients and the general population. In another, the measure correlated appropriately with a cancer HRQoL questionnaire. Very few comparative studies were reported between the investigated GPBMs, and these do not clarify which performs better.
The EQ-5D showed a mixed performance in cardiovascular diseases [34] (including coronary heart disease, cerebrovascular disease, hypertension and heart failure). Although many studies supported the instrument’s convergent validity with other GPBMs, HRQoL measures and functional status measures, and its ability to distinguish known groups based on severities of the conditions and type of conditions, two studies showed poor correlations with HRQoL measures, three had problems in distinguishing between patients and the general population, eight failed to detect statistically significant changes at follow-up and one failed to show differences between treatment arms. Three comparative studies were reported between the EQ-5D and SF-6D, the EQ-5D and HUI3, and the EQ-5D, SF-6D and HUI3. In two of them, correlation between the EQ-5D and SF-6D, and between the EQ-5D, HUI3 and SF 36 were generally poor. The third comparative study presented moderate to strong correlations between the three instruments.
The EQ-5D performance in visual disorders [26] (including macular degeneration, glaucoma, conjunctivitis, diabetic retinopathy and others) was generally mixed. Known groups showed generally poor or mixed validity using severity groups, and generally good validity using patients versus general population groups. Mixed evidence was also reported for convergent validity, with the instrument correlating moderately to strongly with clinical measures only in four of the nine studies that investigated the property. There was mixed and limited evidence for EQ-5D responsiveness, with one study reporting in support, one against and one with mixed evidence for the measure’s responsiveness. All these studies used tests of statistical significance before and after treatment. The HUI3 appeared to be valid although the evidence was limited. Two studies reported a good ability of the measure to distinguish known groups based on the severity of the condition and on patients/general population. Another study reported moderate to strong correlations with functional status measures. A fourth study showed that the HUI3 was able to detect statistically significant changes between treatment arms [26]. Only two studies reported on the SF-6D characteristics, and these showed that the measure performed better than the EQ-5D [26].
EQ-5D performance has been reviewed in only one condition of the nervous system [24], multiple sclerosis, with three studies supporting the instrument’s convergent validity and three reporting weak to moderate correlations with other HRQoL measures. Substantial evidence against the instrument’s ability to distinguish between severity groups was found, with two studies reporting that the measure distinguished only between some severity levels but not others (mixed evidence), and two showing the measure was not able to detect health status differences in any of the severity levels. Evidence for the SF-6D, HUI3 and AQoL was limited, but in support of the measure’s performance [24], with two studies reporting moderate to strong convergence of the SF-6D with HRQoL measures, two showing good discriminative ability of the HUI3 between severity groups, strong correlations of the measure with other HRQoL instruments and two showing good discriminative ability of the AQoL, with the assessment being based on severity levels.
The EQ-5D performance in hearing impairments [27] was poor, with only two studies out of the seven supporting validity and responsiveness, one reporting moderate to strong correlations with other GPBMs and the other reporting statistically significant changes of score before and after treatment. The HUI3 showed a better performance, with all known group assessments but one in favour of the instrument’s validity (based on severity traits) and most of the responsiveness tests showing an ability to detect changes in health status before and after treatment [27]. Although few comparative studies were found, all these suggested that the HUI3 performs better than the EQ-5D in hearing impairment.
Five reviews investigated the performance of the EQ-5D in mental health [17,18,19,20, 35], and all but the one on depression and anxiety showed that the instrument suffered from problems. Three studies showed low correlations between the EQ-5D and HRQoL measures in dementia; four had low correlations between the EQ-5D and the time trade-off, standard gamble and symptom specific measures in schizophrenia; two had low correlations between the EQ-5D and other measures (not specified) in bipolar disorder; and two had low to moderate correlations between the EQ-5D and symptom and severity measures in personality disorders. Evidence against the measure’s validity was also found for known groups in personality disorders and bipolar disorder, with one study showing poor discrimination between groups based on different types of personality disorders, and another poor discrimination between severity levels of bipolar disorder. Convergent validity, known groups and responsiveness results for the SF-6D and HUI3 supported the instruments’ psychometric characteristics, with the exception of an SF-6D known group test that showed mixed results in depression (discriminating only between some groups but not others) [20], although the evidence base was smaller.
Four systematic reviews reported evidence on EQ-5D and SF-6D psychometric characteristics in musculoskeletal diseases [36, 38, 41, 43]. One study reported good convergence for the EQ-5D with another HRQoL measure in rheumatoid arthritis, while another had inconclusive results in chronic low back pain, with data being too sparse to assess correlations. The SF-6D was seen to have moderate to strong convergence with an HRQoL measure in rheumatoid arthritis, but mixed known group results in spinal cord injuries, with three studies supporting the instrument’s discriminative ability and four reporting against it [36].
Evidence for the other ICD disease classes was very sparsely investigated, including haematological, gynaecological and autoimmune diseases, and diseases of the nose. Three reviews investigated injuries, aesthetic surgery and older populations, but evidence was extremely limited, although the few studies available were generally in support of the GPBMs’ psychometric characteristics [21, 31, 36, 38, 39, 43,44,45].