An assessment of validity and responsiveness of generic measures of health-related quality of life in hearing impairment

Purpose This review examines psychometric performance of three widely used generic preference-based measures, that is, EuroQol 5 dimensions (EQ-5D), Health Utility Index 3 (HUI3) and Short-form 6 dimensions (SF-6D) in patients with hearing impairments. Methods A systematic search was undertaken to identify studies of patients with hearing impairments where health state utility values were measured and reported. Data were extracted and analysed to assess the reliability, validity (known group differences and convergent validity) and responsiveness of the measures across hearing impairments. Results Fourteen studies (18 papers) were included in the review. HUI3 was the most commonly used utility measures in hearing impairment. In all six studies, the HUI3 detected difference between groups defined by the severity of impairment, and four out of five studies detected statistically significant changes as a result of intervention. The only study available suggested that EQ-5D only had weak ability to discriminate difference between severity groups, and in four out of five studies, EQ-5D failed to detected changes. Only one study involved the SF-6D; thus, the information is too limited to conclude on its performance. Also evidence for the reliability of these measures was not found. Conclusion Overall, the validity and responsiveness of the HUI3 in hearing impairment was good. The responsiveness of EQ-5D was relatively poor and weak validity was suggested by limited evidence. The evidence on SF-6D was too limited to make any judgment. More head-to-head comparisons of these and other preference measures of health are required.


Introduction
Economic evaluations using a cost-utility framework have been increasingly used to support decision-making in the allocation of health resources and priority setting [1][2][3].
Cost-utility analyses assess health care interventions in terms of the incremental cost per quality-adjusted life year (QALY) gained. QALYs combine length of life with a quality of life where the quality of life component is usually based on health state utility values [4].
One common way to obtain health state utility values is to use one of the generic preference-based measures (GPBMs) of health-related quality of life. Examples of the most commonly used GPBMs include the EuroQol 5 dimension (EQ-5D) [5,6], Short-form 6 dimension (SF-6D) [7] and the Health Utilities Index (HUI3) [8]. For the purpose of economic evaluation, these measures have the advantage of reflecting the value people place on different dimensions or levels of health and enable comparisons of health outcomes to be made across conditions. EQ-5D has 5 health dimensions (mobility, self-care, usual activities, pain/discomfort and depression/anxiety). Each dimension has 3 levels of severity in the original version, and a version with 5 levels of severity has recently been developed [5,9]. Derived from the Short-form 36 and Short-form 12 health questionnaires, the SF-6D has 6 dimensions (physical functioning, role limitation, social functioning, bodily pain, mental health and vitality), and each dimension has 4-6 severity levels. The HUI3 has 8 dimensions (vision, hearing, speech, ambulation, dexterity, emotion, cognition and pain), and each dimension has 5 or 6 severity levels. These measures differ in terms of the description of health. The HUI3 can be seen as a 'within the skin' measure of health and includes sensory dimensions such as vision, speech and hearing. EQ-5D and SF-6D focus more on how health impacts on functioning in life, but nonetheless there are important differences in coverage, such as EQ-5D not having vitality and SF-6D containing role and social functioning dimensions compared to usual activities in EQ-5D. Apart from the different descriptive systems, the measures also differ in terms of the methods used to estimate health state values.
Empirical evidence has confirmed that health state utility values obtained from these three GPBMs are different from each other [10][11][12][13][14]. In order to allow comparability between conditions, in the United Kingdom, the National Institute for Health and Clinical Excellence has prefers the use of a single GPBM, the EQ-5D [15]. However, GPBMs have attracted criticism for failing to capture important aspects of health and insensitive to the change of health states because one or more important dimensions of health relevant to a medical condition have been excluded [16][17][18] [19][20][21]. National Institute for Health and Clinical Excellence recognized that there may be specific circumstances in which the EQ-5D is not appropriate and offers some advice for these circumstances; however, it does not identify those areas where EQ-5D is inappropriate nor provide criteria to determine this.
More evidence on the performance of EQ-5D and other GPBMs is required for a wider range of conditions and/or treatments to demonstrate whether these measures are appropriate for these conditions in order to judge when alternative measures should be considered. The assessment of the validity and responsiveness of GPBMs is fraught with conceptual and empirical problems owing to the lack of a gold standard measure. However, by taking into account of a range of evidence on specific conditions in a systematic and transparent way, it is possible to judge the performance of the instruments [22]. This is important in order to help inform which measures should be included for the assessment of benefits of specific health interventions, or for interpreting the evidence from population studies that include such instruments. For example, recently in the United States, the National Health Measurement Study (http://www.health measurement.org/NHMS.html) attempted to build a versatile 'toolbox' for this purpose, and EQ-5D, Short-form 36 version 2 (from which SF-6D can be derived) and HUI were included in the toolbox (http://www.healthmeasurement. org/NHMS.html).
Hearing impairment is one of the most common chronic health problems in Western society, in part due to the growth of the elderly population, affecting 15 % of the adult population [23]. Hearing loss affects a person's ability to communicate, social participation, independence, employment and overall quality of life [23]. Previous research has suggested that health state utility values obtained from people with hearing impairments are different using different generic instruments [24]. A review of the evidence on the validity of GPBMs in hearing impairment has not been previously undertaken. The aim of this study was to systematically review the published literature to assess the reliability, validity and responsiveness of three key generic measures of health-related quality of life (EQ-5D, HUI3 and SF-6D) in people with hearing impairment.

Search strategy and data identification
The objective of the literature review was to identify published papers reporting evidence of the performance of EQ-5D, HUI3 and SF-6D in patients with hearing impairments.
A broad search was conducted to identify studies reporting EQ-5D, SF-6D and HUI3 to examine the healthrelated quality of life of patients with a hearing impairment. BIOSIS, CINAHL, EMBASE, MEDLINE, PsychI-NFO and Web of Science electronic databases were searched. The database available from the EuroQol Group Website was also searched but comparable databases for HUI3 and SF-6D are not available. The search focused on key words search, including 'hearing impairment/disorder', 'euroqol/EQ-5D', 'hui3' and 'sf6d', all with alternative spellings. The search strategy is presented in Appendix 1. The criteria for inclusion were that the study population had a hearing impairment, the study reported at least one from the EQ-5D, SF-6D or HUI3 and reported another measure of quality of life (generic-or condition-specific) or a measure of clinical severity, or direct valuation of health. Papers only reporting EQ-VAS (EuroQol Visual Analogue Scale) scores were excluded as the main interest was the descriptive systems and utility indices of the three measures. Papers that only used vignettes or own health state valuations, and not one of the three generic measures, were excluded. There was no restriction relating to the type of study. Due to resource limitations, only English language studies were reviewed.

Data extraction
Considering the aim of the study and reviewing forms used for similar studies in other disease areas [17,25], a template was developed to extract data in a standardised format including the following: • Study characteristics-country, type of hearing impairment, disease or treatment stage, any treatment given, study design; • Participant characteristics-number of participants, age, gender, ethnicity, missing data; • Instruments used-EQ-5D/SF-6D/HUI3, other generic measures of health-rated quality of life, conditionspecific health-related quality of life measures and clinical measures of disease severity, patient's own health state valuations (e.g. Visual Analogue Scale (VAS), Time trade-off (TTO) and Standard Gamble (SG)); • Health state utility values-mean of utility index, scoring algorithm; • Construct and convergent validity-methods of assessment and results; • Responsiveness-methods of assessment and results.
• Reliability-methods of assessment and results.

Quality assessment of studies
For the review, of the most importance was the relevance of the study in terms of the patient population and inclusion of evidence to answer our research question. Nevertheless, the quality of studies was assessed by examining study design, recruitment process, sample size and the extent of missing data reported. The intention of the assessment of quality was not to exclude relevant studies. Rather, it gives some indications of quality to assist with the interpretation of the findings. It should be noted that some studies may be of high quality for their research question, but provide limited information for the assessment of validity, reliability and responsiveness. For example, a case-control study may be well designed and conducted, but offer information limited to the presence or absence of hearing problems, and not detailed information on the severity of the condition.

Assessment of validity
Validity is defined as how well an instrument measures what it was intended to measure [22,26]. Ideally, validity would be assessed by comparing an instrument to an established gold standard. However, in the case of healthrelated quality of life, no gold standard measure exists. Therefore, construct validity was assessed by making comparisons with other measures of quality of life and disease severity, and assessing the totality of that evidence to see whether the results from the GPBMs reflect the patterns in scores seen in those other measures [22].
A common test to identify construct validity is the 'known group' method [22]. This is determined by the degree to which an instrument can demonstrate different scores for groups know to vary on the variables being measured. In this study, health state utility values are compared between groups of patients that are defined in terms of disease severity and trends in the pattern of utility, statistical tests (e.g. t test) and regression were used for assessment. The patient population could be stratified on the basis of a clinical indicator or a health-related quality of life measure (generic-or condition-specific). A less stringent test of construct validity is to define groups using a case-control analysis where scores of patient group and non-patient groups or general population are compared.
Another type of construct validity is known as convergent validity [22]. This is defined as the extent to which one measure correlates with another measure of the same concept (although this measure is not regarded as gold standard). In this review, the extent to which EQ-5D, SF-6D or the HUI3 correlated with other measures of hearing problems or health-related quality of life was examined based on statistics including correlation coefficients or regression analysis with hearing-specific healthrelated quality of life measures or measures of hearing loss.

Assessment of responsiveness
Responsiveness is the ability to measure change. A pre/ post-intervention study which reports EQ-5D, SF-6D or the HUI3 and another valid measure of health change would allow the responsiveness of a measure due to change in health status to be identified. As with the tests of validity, it is important to consider whether the measures of health change that used for comparison are valid themselves. In addition, it is important to consider whether other health changes not directly related to the condition could have impacted upon health-related utility (for example, side effects of treatment).

Assessment of reliability
The reliability of a measure is defined as its ability to reproduce results when measurements are repeated on an unchanged population [22]. Reliability can be measured by retesting and reporting either the correlation or difference between estimates. For this study, the measures were considered reliable if they demonstrated no change in health-related quality of life when the other reference measures also demonstrated no change in health.

Presentation of data
Data were presented in a series of summary tables as well as brief text, providing information on characteristics and quality assessment of included studies, the measures included, methods and result for validity assessment, methods and results for responsiveness assessment. At the end, a table providing an overview of performance of EQ-5D, HUI3 and SF-6D was presented recording the findings as a 'H' if the evidence supported the statement, or 'x' if the evidence did not support the statement, or '?' if the evidences were mixed and conclusion could not be made, or 'N/R' if no information was reported. If the pattern and direction of EQ-5D were consistent with other measures in terms of difference between groups or change over time, this was considered as supporting evidence. Correlation coefficients were grouped as small (\0.3), moderate (0.03-0.5) and strong ([0.5), and a significant predictor of regression was recorded as 'H'.

Search results
Bibliographic searching was completed in July 2010. The search strategy identified 119 articles. After reviewing titles and abstracts, 70 papers were excluded. Forty-nine papers were reviewed in full, and a further 31 were excluded and 18 papers were included in the final review (see Fig. 1). Papers were included if they provided sufficient evidence to assess the validity, responsiveness and reliability of EQ-5D, HUI3 and SF-6D. However, the paper did not have to have been designed for this purpose. Papers were excluded if they did not include one of the generic measures of interest. Papers were also excluded if validity or responsiveness could not be assessed because no other clinical or quality of life measures were included, and differences over time or between interventions were not reported. Since the focus of the review is preference-based measures, papers only reporting Visual Analogue Scale scores were also excluded.

Quality assessment and characteristics and of included studies
Most studies were not designed to specifically assess the validity, reliability and responsiveness of the instruments, but all provided data in sufficient detail to allow an assessment to be made. A range of recruitment procedures was noted in the studies included in the review. Some were cross-sectional observational studies [27,28] but the majority were prospective or retrospective before-after studies [29][30][31][32]. Studies had well-defined inclusion/exclusion criteria in recruitment. Sample size ranged from 20 to 3,272 with most of studies had around 100 participants but two studies only had around 20 participants [29,33]. For longitudinal studies, no study had extremely high levels of missing data. The reporting in these papers was reasonably clear. After quality assessment, no studies were excluded from the review.
The main characteristics of the 18 papers included in this review are shown in Table 1. The four papers by Joore et al. [31,32]; Joore [34,35] reported the results of one specific study and similarly the two papers by Vuorialho et al. [36] reported a single study, resulting in 14 studies in total. The studies were undertaken in a range of countries, including United Kingdom, the Netherlands, USA, Canada and Finland. Some studies recruited patients with specific hearing problems, for example, Large Vestibular Aqueduct Syndrome [29], profoundly deaf or conductive or mixed hearing loss [37,38]. Twelve papers reported hearing loss of their sample using clinical indicators such as the better ear unaided pure-tone average. As shown in Table 1, the level of hearing loss varied between studies.
Five studies included young children with hearing impairments (mean ages of the samples ranged from 7.3 to 9.3 years old), and the remaining studies included adults in their studies with most focussing on older adults over 60 years. The studies involving children used parents or caregivers as proxies to assess health-related quality of life of children. Table 2 summarises the measures which have been used in the 18 papers. For the three generic preference-based measures of interest, 11 papers reported EQ-5D, 11 reported HUI3 and 1 used the SF-6D (alongside EQ-5D and HUI3). Among those studies that used EQ-5D, most reported the EQ-5D index based on the tariff of UK population values. In two cases, it was unclear which tariff of population values had been used. Three papers also reported responses on the five EQ-5D dimensions alongside the utility indices using tariff [31,34,39]. One study [30] also compared EQ-5D results using the UK and Dutch tariff. Only the Canadian tariff was used to value the HUI3. Two studies used Quality of Well-being (QWB, another generic preference-based measure of health) alongside EQ-5D or HUI3.

Measures and clinical indicators used in the studies included
A total of 11 papers also reported Visual Analogue Scale (VAS) results including: EQ-VAS, a general health VAS, a hearing-specific VAS and a general quality of life VAS. In total, 7 papers reported the EQ-VAS and used imaginable best and worst imaginable health as anchors. Among them, 4 publications related to a single study reported the results of a hearing-specific VAS [31,32,34,35] using 'deaf' and 'perfect sense of hearing' as the anchors. One study [40] regarded hearing impairment as having an effect beyond health or HRQoL, so alongside EQ-VAS, another VAS using the best/ worst quality of life as anchors were also reported. The remaining four papers reported a general health VAS and among them, one used 'death' and 'the imaginable best health' as anchors, one used 'imaginable worst/best health' as anchors and for the other two, anchors were not clearly reported.
Time trade-off (TTO) values were obtained without the use of generic measures in 2 studies [33,37]. The study by Cheng et al. used parents as proxies to assess their deaf children's utility and the TTO compared two alternatives: one being in the current health state without hearing aid for remaining life expectancy and another alternative being in perfect health for a shorter time period. The values in the study by Lee et al. were not actual TTO values but predictions from VAS transformed using a power formula.
A total of 9 studies employed self-reported hearingspecific health-related quality of life measures. This included 3 studies using the Hearing Handicap Inventory

Reliability of GPBMs in hearing impairment
The review found little evidence on the reliability assessments of EQ-5D, HUI3 and SF-6D in hearing impairment.
No papers reported conducting test-retest experiments.
Although not specifically for test-retest reliability purposes, one study [34] reported EQ-5D responses and VAS indices at baseline and asked respondents to recall them 3 months after hearing aid fitting. They did not find any significant difference between the baseline assessment and the recalled assessment of baseline health for EQ-5D.

Construct validity of GPBMs in hearing impairment
Out of the 18 papers include in the review, 7 papers provided information to enable an assessment of the EQ-5D, HUI3 or SF-6D, although most studies were not designed to examine the validity of these measures. The results are summarised in Table 3.

'Known group' differences
Seven studies presented data to allow an assessment of 'known group' differences where the groups were defined by severity of hearing loss. Using ANOVA, the study by Grutters et al. [30] demonstrated that EQ-5D failed to detect significant differences by hearing loss severity groups, whereas HUI3 detected did. Another study found that EQ-5D differentiated the group with the most severe hearing loss but not groups defined by milder levels of deafness [40]. Barton et al. [41] reported that HUI3 mean scores were different between moderate, severe, pro-found1, profound2 and implanted groups defined by the average of pure-tone air-conduction thresholds at the frequencies 0.5, 1, 2 and 4 kHz in the better hearing ear, although no statistical test was reported. Palmer et al. [42] showed that HUI3 successfully discriminated between people with hearing aids (0.76) and without hearing aids (0.58) at 6 months (p \ 0.001) and 12 months after intervention (p \ 0.1) using t test. Similarly, HUI3 discriminated 2 groups of patients with cochlear implant and with normal hearing aids where the hearing loss severity of these 2 groups was different according to their pure-tone average. In a study comparing HUI3 and QWB in hearing loss, both scores declined with the degree of hearing loss for children who did not have a cochlear implant with a much greater extent for HUI3 than QWB [28]. A further study of the HUI3 found that it did not differentiate between groups defined according to unilateral or bilateral implantation [27]. However, this finding was also reflected in the VAS measure and may reflect that the additional impact of bilateral implantation in this group is small.

Convergence
Three studies presented data for an assessment of convergence of EQ-5D and HUI3. HUI3 showed poor correlation with two speech perception tests; however, a hearing-specific quality of life measure also showed similar results [43]. Gruters et al. [30] reported a moderate correlation between EQ-5D and HUI3. Barton et al. [41] reported a regression analysis and showed that CI (grouped by age at implantation and duration of use), the average of pure-tone air-conduction         [24] reported strong correlations between EQ-5D, HUI3 and SF-SD in their study.

Responsiveness of GPBMs in hearing impairment
Twelve papers involved a total of 8 studies that provided adequate information to allow an assessment of responsiveness of EQ-5D and/or HUI3 (see Table 4). Only two studies were specifically designed to examine responsiveness of different measures and responsiveness indices such as effect size and standard response mean were reported [38,39].

Comparison between GPBMs
Comparison of mean change scores of different GPBMs using statistical tests were reported by Gruter et al. [30] and Lee et al. [33] (See below

Responsiveness of EQ-5D
Eight papers reported the responsiveness of EQ-5D without the other 2 generic measures by comparing them with EQ-VAS, hearing VAS or other hearing-specific measures, which involved a total of 4 separate studies. In these studies, no statistically significant changes before and after the hearing intervention were detected by the EQ-5D [31,32,34,35,38,39] and the effect size where reported was very low [38]. Whereas statistically significant improvements were shown in VAS scores [32,[34][35][36]39], and condition-specific measures such as two sub-domains (disability and handicap) of Hearing Handicap and Disability Index [38], overall scores of the Hearing Handicap Inventory for the Elderly and its 2 sub-domain scores [36,39], and the 5 questions of Amsterdam Inventory and Audiological Disabilities Preference Index [35]. Joore's study reported the self-perceived SF-36 social functioning which was significantly improved after hearing aid fitting in long term.  [43].

Discussion and conclusions
The 18 papers (14 studies) included in this review provide useful information to assess the validity and responsiveness of GPBMs for use in hearing impairment. A summary of the overall performance of the 3 GPBMs is provided in Table 5. There was heterogeneity in the studies reviewed, in terms of study design, patient populations, which needs to be taken into account when interpreting the findings. Overall, the HUI3 was the most commonly used measure in the studies. In all 6 cases, the HUI3 detected difference between groups defined by their severity of hearing impairment and 4 out of 5 cases detected statistically significant changes as a result of intervention. Differences picked up by the HUI3 were driven by the hearing dimensions, and also, in some cases, the speech dimension and the emotion dimension. On the other hand, the findings of the review suggested relatively poor responsiveness of EQ-5D in this condition as in 4 out of 5 cases EQ-5D failed to detected change. The only study that allowed an assessment of known groups using the EQ-5D suggested it only had weak ability to discriminate difference between severity groups. Only one study involved the SF-6D, thus the information is too limited to conclude on its performance. Converting published mean SF-36 scores into SF-6D would not help since psychometric testing requires individual level data.
Two studies reported validity of EQ-5D where the results were mixed when the groups were defined by severity of hearing impairment [40]. In terms of responsiveness, EQ-5D did not demonstrate statistically significant changes after hearing aid fitting but there were statistically significant changes detected by clinical indicators or conditionspecific measures. The EQ-5D appears to reflect less, or often no, change/difference compared to the HUI3 and clinical measures. Possible reasons for this include the EQ-5D not capturing important effects of hearing on quality of life, or that the changes/differences in hearing have little impact on overall quality of life (e.g. because the level of change is small or people have adapted to their hearing loss and value other aspects of health more). Also, the use of clinical measures or else for grouping hearing impairment severity may be regarded as poor indicators to use for testing construct validity. However, given that significant differences were found for HUI3 and hearing-specific health-related quality of life measures, it suggests that the lack of significant differences for EQ-5D is a concern.
There were 2 exceptions to the poor performance of EQ-5D: one study demonstrated a statistically significant improvement in EQ-5D index scores after cochlear implantation and in another study, the EQ-5D differentiated between severe hearing loss but not in different levels of milder hearing loss. Although EQ-5D utility indices remained stable over time, 1 study showed that the proportion of respondents who reported problems for dimensions of EQ-5D increased or decreased. Another study treated responses of EQ-5D dimension as continuous variables to compare mean responses before and after intervention, which was judged to be problematic for analysis. There were 2 studies where the HUI3 reflected differences, but much smaller differences were found in the EQ-5D utilities.
It is perhaps unsurprising that HUI3 performs well as it explicitly includes a hearing dimension. The lack of a direct reference to an impairment or symptom in EQ-5D does not mean that, by definition, its effects will not be captured; however, this review suggests that EQ-5D may perform poorly in this particular type of condition. A fivelevel version of the EQ-5D has recently been developed and this may overcome the problem if it is simply one of the sensitivity. However, it may be related to a lack of relevant dimensions to pick up the impact of hearing loss. Evidence is required using the 5-level version to confirm this hypothesis. Another approach to the problem might be to use a condition-specific preference-based measure for hearing like the one developed by Yang et al. for asthma [44] or Rowen et al. in cancer [45]. The problem with using condition-specific measures is that they may miss important side effects of treatment and the values may exaggerate the impact of the conditions due to focusing effects by member of the general public [46]. This is the reason for ongoing research into developing bolt-ons to the EQ-5D to cover those dimensions that appear to be missing for cognition in EQ-5D [47], and in the same way, one could be developed for hearing. In the meantime, the best option appears to be to use HUI3.
It cannot always be assumed that a generic measure should reflect the change of health states which a condition-specific measure detects. Disease-or condition-specific measures are tailored to the condition of interest and are therefore more focussed on the condition of interest and may be more sensitive to change. However, they may not capture the broader impacts of the condition on healthrelated quality of life. In addition, the general population (or indeed the patients) may not regard the change as sufficiently important when valuing health. Interestingly, in this study, not only GPBMs, but also the hearing-specific measures showed poor correlations with clinical indicators. This emphasises the importance of including patientreported outcome measures in the evaluations of health care interventions. In addition, preference-based measures can reflect how changes in health states are valued, in relation to other aspects of health. This is the first-time information on the validity and responsiveness of GPBMs that have been comprehensively reported and analysed in hearing impairment. This paper reports important findings for the use of GPBMs of health to compare the impact of hearing loss on healthrelated utility. The results indicate that HUI3 is an appropriate measure for use in hearing impairment given its good performance of validity and responsiveness. EQ-5D was not responsive to modest changes in hearing impairment, and the limited evidence suggested it has weak validity in this condition. Very little evidence was found for SF-6D.