Introduction

A cataract is an opacity of the eye lens that is the leading worldwide cause of blindness [1]. It can be successfully treated using surgery, which is the most common operation conducted in many countries [2]. Cataract surgery rates range from up to 10,000 operations per million population in 1 year in developed countries (e.g. USA) to less than 500 in developing countries (e.g. Ethiopia, Kenya) [2].

The National Institute for Health and Care Excellence (NICE) recommends the use of economic evaluation to inform decision-making about which treatments to fund [3]. For the evaluation of interventions funded by the NHS, cost–utility analysis (CUA) is the preferred method of economic evaluation and the quality-adjusted life year (QALY) is the preferred measure of benefit [3]. The ‘quality adjustment’ can be derived from preference-based measures (PBMs). Scores on these questionnaires are weighted to reflect the value the general population has placed on a particular state of health. PBMs can be used to value the effects of healthcare interventions in one index measure. PBMs could reflect generic health-related quality of life (HRQL) (e.g. EQ-5D), generic measures of HRQL expanded to include disease-specific dimensions (e.g. EQ-5D ‘bolt-ons’) or measures broader than HRQL such as capability wellbeing (e.g. ICECAP). This study sought to explore which PBMs are most appropriate in a cataract patient population.

The EQ-5D [4] is endorsed by NICE [3]. The first iteration of the measure, the EQ-5D-3L, comprises five ‘core’ questions about five domains of HRQL, each question has three response options. These domains are mobility, self-care, usual activities, pain and anxiety/depression. There is criticism however that the EQ-5D-3L is insensitive [5, 6] and in non-acute conditions, a significant number of patients score the highest value of one (ceiling effect) [6]. In response to these and other concerns, the EQ-5D-5L was developed. The EQ-5D-5L [7] increases the possible response options from three to five levels and modifies some wording. There are currently two algorithms to generate EQ-5D-5L preference-based utilities for a UK sample; A Value Set for England (EQ-5D-5L-VSE) [8] and the EQ-5D Crosswalk (EQ-5D-5L-CW) [9]). NICE recently confirmed their position that the EQ-5D-5L-CW should be used to generate utilities [10].

Another criticism is that the EQ-5D domains are not relevant to certain conditions, including visual impairment [11, 12]. Consequently, an EQ-5D-3L vision bolt-on was developed (EQ-5D-3L + VIS) [12], asking a sixth question about their vision (using glasses or contact lenses if needed). The item is worded as follows: I have no/some/extreme problems seeing. Methodological issues potentially impeding bolt-on implementation include small samples used to value the bolt-on, no validation of value sets and that they appear to impact responses to ‘core’ EQ-5D domains [12, 13].

The capability approach offers an alternative to HRQL, where an individual’s ability, or capability to function is the outcome evaluated [14]. Cataracts may limit individuals’ ability to live fulfilling lives, but successful cataract surgery could reduce limitations caused by impaired vision. Capability measures may better capture these benefits compared to HRQL outcomes. The ICECAP-O [15] measures this construct in older adults, but it has not been used in a cataract patient population [16]. The ICECAP-O’s five domains cover attachment (love and friendship), security (thinking about the future without concern), role (doing things that make you feel valued), enjoyment and control (independence). Each attribute has four response options. The ICECAP-O has been valued in the UK using a best–worst scaling approach [17], in contrast to the EQ-5D which used time trade off (TTO) (EQ-5D-3L, EQ-5D + 3L + VIS) and TTO/discrete choice methods (EQ-5D-5L).

Assessing the cost-effectiveness of cataract treatment options requires a PBM suitable for use in this population. Previous studies have assessed the responsiveness of PBMs in cataract patients undergoing surgery [18, 19], however, they had relatively small sample sizes (< 400) and none have included a measure of capability wellbeing. The questions of which PBM to use in cataract surgery patients and whether health is the most appropriate outcome, remain to be answered. Brazier et al. [20] recommend that a PBM is chosen based on the psychometric performance (including construct validity and responsiveness) in the patient population. Whilst the EQ-5D has been used in a cataract population before [18, 21], there is currently no published evidence of the use of the ICECAP-O [16]. The objective of this analysis was to evaluate the construct validity and responsiveness of the EQ-5D-3L, EQ-5D-5L, EQ-5D-3L + VIS and ICECAP-O in a cataract patient population.

Methods

Participants

The analyses used data from the Predict-CAT study, a cohort study of cataract surgery patients. Eligible participants were aged 50 or over, were able to understand and complete the PBMs, and were approaching either their first cataract surgery on either eye or a second surgery on the fellow eye. Participants were recruited from two NHS trusts (University Hospitals Bristol NHS Foundation Trust and Gloucestershire Hospitals NHS Foundation Trust) in the South West of England at the time of being listed for cataract surgery or at a pre-operative assessment appointment.

Data collection

Participants attended two study visits, before and after surgery. The post-operative appointment was scheduled to take place 6–8 weeks after cataract surgery, although in practice there was some variation. All participants completed the Cat-PROM5 and ICECAP-O. The Cat-PROM5 is a five-item questionnaire designed to measure the HRQL impact of cataract surgery [22]. The Cat-PROM5 is responsive to changes in vision following cataract surgery (Cohen’s d = –1.45) [22], although is not preference based and thus cannot be used in CUA. It is currently being piloted in the National Ophthalmology Database Cataract Surgery Audit [23], its use having been encouraged by NICE [24]. Cat-PROM5 is comparable or performs better than the widely used CATQUEST-9SF nine item questionnaire [25].

For the remaining measures, participants were randomised in a 1:1:1 allocation to complete either the EQ-5D-3L, EQ-5D-3L + VIS or the EQ-5D-5L. Randomisation used an automated allocation when participants were added to the study database. Data collected also included socio-demographic information, medical history, assessment of visual function, and an ocular examination with pupil dilation.

A description of the measures and their scoring is provided in the Supplementary material. Lower Cat-PROM5 scores reflect better quality of life (QOL), whereas higher PBM scores are better. Both the EQ-5D-5L-CW [9] and EQ-5D-5L-VSE [8] algorithms were used to score the EQ-5D-5L.

Descriptive statistics

Clinical (visual acuity, diabetic status, first or second eye surgery, complications) and socio-demographic characteristics (age, gender) of the sample were summarised. Descriptive statistics were generated for PBM indices at baseline and follow-up. We estimated the proportion of participants scoring the maximum and minimum scores at baseline. A threshold of 15% of was chosen [26] to define potentially problematic ceiling (maximum) or floor (minimum) effects. Large proportions of patients reporting the highest or lowest value at baseline reduces the potential to demonstrate either improvement or decline in condition following cataract surgery.

Construct validity

Convergent validity

Convergent validity is the association between the measures of interest and outcomes measuring the same or overlapping constructs. Spearman’s rank correlations were calculated between all PBM and Cat-PROM5 scores at baseline. Correlation coefficients were interpreted using Cohen’s thresholds (large >  ± 0.5, ± 0.5–0.3 moderate, ± 0.3–0.1 small, <  ± 0.1 insubstantial) [27, 28]. The relationship between visual acuity, measured using a LogMAR chart, and the PBMs was explored by measuring Spearman’s correlations between the PBMs and habitual near visual acuity in the eye to be operated on (referred to as operated eye at baseline hereafter).

Known-groups validity

Known-groups compare the outcome measure in groups that are expected to differ. Three group comparisons were chosen based on previous research. These were (1) whether it was participants’ first or second eye surgery (baseline scores, second eye surgery participants expected to have higher HRQL/capability) [11], (2) visual acuity as good (≤ 0.3 LogMAR) or poor (> 0.3 LogMAR) monocular habitual near visual acuity (baseline scores, patients with poorer visual acuity expected to have worse HRQL/capability) [29, 30], and ocular comorbidities (baseline scores, participants with comorbidities expected to have worse HRQL/capability) [31]. Linear regressions were conducted to compare scores between known-groups. This approach is commonly used when analysing utility scores bounded at one [32, 33]. The group was the predictor and PBM scores the dependent variables. Covariates in all regressions were age, gender and diabetic status. Analyses were stratified by EQ-5D randomisation group; thus ICECAP-O known-group differences were tested three times. This eliminated the potential for the ICECAP-O to appear to perform better simply due to the larger sample completing that measure.

Responsiveness

A PBM is responsive if changes in the index score reflect known changes in health [20]. These changes are defined using external indicators (anchors) of either clinical or patient-reported change, but they must be relevant to the condition. After surgery, patients completed two questions that asked about their perceived benefit of surgery and change in visual QOL. These were appended to the Cat-PROM5 post-operative questionnaire. These response options were used as anchors and participants were categorised into the following groups:

Perceived benefit of surgery

  • I have gained significant benefit

  • I have not gained significant benefit/I am worse off

Change in visual QOL

  • Visual QOL has improved significantly

  • Visual QOL has not changed by a significant amount/it’s worse

The no change/worsening response options were combined due to few participants reporting worsening [perceived benefit N = 34 (2.81%), visual QOL N = 27 (2.23%)].

Change in visual acuity was used as a clinical anchor. Based on a change in monocular habitual near visual acuity threshold (− 0.2 LogMAR), participants were categorised as either improving or experiencing no change/worsening. This was based on clinical expertise, previous literature [34] and data from the National Ophthalmology Database Audit [23].

For each PBM, change in PBM index was calculated between baseline and follow-up for each patient. Mean difference was compared between groups gaining significant benefit and those not changing by a significant amount/worse off using analysis of covariance (ANCOVA). Covariates included age, gender, diabetic status and complications.

Effect sizes were calculated for each PBM index score. These statistics quantify the difference between pre- and post-surgery scores in standardised units, enabling the comparison of the PBMs. Effect size is the change score divided by the standard deviation at baseline (Cohen’s d). Where no comparative data are available, effect sizes can be interpreted using Cohen’s thresholds [27, 28]. These are 0.20 small change, 0.50 moderate change and 0.80 large change [27]. Effect sizes for participants worsening/experiencing no benefit were expected to be less than those experiencing benefit. An effect size smaller than 0.2 would be expected when no change/worsening occurred.

Evaluating the performance of the PBMs

We examined the construct validity and responsiveness of the PBMs using the properties reported in Brazier et al. [20].

The following criteria were tested.

  • Less than 15% of participants would score the maximum or minimum score at baseline.

  • At least moderate correlations (coefficients 0.3–0.5) were expected between generic PBMs and the Cat-PROM5 at baseline.

  • PBMs would distinguish between the following known-groups at baseline: patients with good or poor vision, patients with and without ocular comorbidities and first or second eye surgery.

  • Effect sizes of change would be less than 0.2 for participants experiencing no change or worsening in visual QOL or experiencing no benefit of surgery.

  • Effect sizes of change would be greater than 0.2 for participants experiencing improvements in visual QOL and visual improvements.

Results

3742 potentially eligible patients were approached to participate. Of these, 2230 (59.6%) declined and 6 (0.2%) were not eligible. Of the 1506 who consented, 191 (12.7%) did not complete baseline questionnaires. Table 1 presents baseline sample characteristics on the 1315 study participants. The characteristics appear to be balanced across randomisation groups, although slightly more participants completed the EQ-5D-5L at baseline than the other EQ-5D questionnaires. This was due to lower attrition between randomisation and baseline questionnaire completion in this group. In total, 105 (8%) patients did not provide any follow-up PBM data. The majority of participants were of White British ethnicity, did not have diabetes and were having their first cataract surgery. Approximately a quarter of participants had near visual acuity in their operated eye at baseline that could be described as ‘good’ (visual acuity ≤ 0.3 LogMAR).

Table 1 Baseline characteristics

Descriptive statistics

Descriptive statistics for the PBMs are reported in Table 2. The EQ-5D-3L + VIS had the highest mean PBM index at both timepoints. All mean PBM scores increased between baseline and follow-up. Variability was greatest for the EQ-5D-3L, with the largest standard deviations observed for this measure. Variability for ICECAP-O and EQ-5D-3L + VIS were lowest.

Table 2 Descriptive statistics of PBM index scores at baseline and follow-up

Floor and ceiling effects

For the EQ-5D-3L, 27.1% (118/435) of participants scored one at baseline (index profile 11111), greatly exceeding the threshold indicating a potentially problematic ceiling effect. The EQ-5D-5L marginally exceeded the 15% threshold also (69/439, 15.7%). Of the generic PBMs, the ICECAP-O had the smallest ceiling effect (123/1308, 9.4%). The EQ-5D-3L + VIS was marginally lower (38/436, 8.7%). A high proportion of participants scored 0.961 on the EQ-5D-3L + VIS (74/436, 17.0%). This corresponds to an index profile of 111112 (‘no problems’ on all EQ-5D-3L domains and ‘some problems’ on the vision bolt-on domain). As usually observed, all PBM distributions were negatively skewed, with more participants reporting good health. No participants scored the lowest possible PBM score.

Convergent validity

Correlation coefficients between the PBMs were all strong (> 0.5; Table 3). Moderate associations were observed between the Cat-PROM5 and the EQ-5D-3L + VIS, EQ-5D-5L-VSE and the ICECAP-O but not the EQ-5D-3L and EQ-5D-3L-CW. As expected, the correlation coefficient between the vision-specific EQ-5D-3L + VIS and Cat-PROM5 was largest. In regard to correlations between PBMs and visual acuity, the relationships were either small (0.1–0.3 for the EQ-5D-3L + VIS, EQ-5D-5L-VSE and EQ-5D-3L-CW) or insubstantial (< 0.1 for the ICECAP-O and EQ-5D-3L).

Table 3 Spearman’s correlation coefficients between PBMs, Cat-PROM5 and visual acuity (baseline data)

Known-groups

For the previous cataract surgery and baseline visual acuity known-groups, the mean differences in PBM scores were small, but in the expected direction (Table 4). For the ocular comorbidities known-group, the mean differences in PBM scores were also small, and not consistently in the expected direction. In almost all analyses the confidence interval spanned zero.

Table 4 Linear regression analyses of known-groups validity

Responsiveness

For the improvement in QOL anchor, the mean difference in scores was in the expected direction for all PBMs, but the EQ-5D-5L mean difference was closer to zero and the confidence interval for that PBM included zero (Table 5). All PBMs identified moderate (EQ-5D-3L + VIS) or small (other PBMs) effect sizes in patients who reported QOL improvements. Unexpectedly, there was a small positive effect size for the EQ-5D-3L + VIS in patients who stated that their QOL had not improved.

Table 5 Responsiveness: comparisons of PBM and Cat-PROM5 change scores between anchors of change

For the perceived benefit of surgery anchor, the mean difference in scores was again in the expected direction for all PBMs. However, the mean difference was largest for the ICECAP-O and that was the only PBM where the confidence interval excluded zero. All PBMs identified small effect sizes in patients who reported significant benefit from surgery. Again, there was a small positive effect size for the EQ-5D-3L + VIS in patients who stated that they had no significant benefit from surgery.

For the visual acuity anchor, mean differences in PBM scores were close to zero. The EQ-5D-3L + VIS and ICECAP-O identified small-positive effect sizes in patients whose visual acuity improved. For patients with little or no improvement in visual acuity, effect sizes were similar across all PBMs.

Summary

The EQ-5D-3L and EQ-5D-5L did not perform well across almost every measure of validity and responsiveness and had the largest ceiling effects (Table 6). The EQ-5D-3L + VIS had a lower ceiling effect and better convergent validity with the Cat-PROM5. It was able to differentiate between patient groups who did and did not report benefit from surgery and improved visual QOL after surgery. However, it also identified small positive effect sizes in patients who reported no benefit or no improved visual QOL after surgery. The ICECAP-O also had a low ceiling effect and there was some evidence of convergent validity with the Cat-PROM5. It performed best on many measures of responsiveness.

Table 6 Summary of PBM performance against criteria evaluated

Discussion

Principal findings

Predict-CAT is a large cohort study that resulted in a detailed dataset describing the patient-reported impact of cataracts before and after surgery. The core EQ-5D measures did not perform well across the tests of validity and responsiveness conducted. There was little evidence that the EQ-5D-5L is more responsive than the EQ-5D-3L. The ICECAP-O was more responsive than the EQ-5D measures to post-operative improvements in visual QOL and the perceived benefit of surgery, although the effect sizes were small. None of the PBMs were responsive to changes in visual acuity.

Strengths and weaknesses

This is the first published use of the ICECAP-O in cataract patients and the first that allows the comparison of the EQ-5D-5L, EQ-5D-3L and the EQ-5D-3L + VIS. This large cohort was mostly representative of UK cataract surgery patients, with a similar median age (75) and baseline visual acuity (0.5 LogMAR) as the UK National Ophthalmology Database Audit 2018 [23] (median age 76.3, visual acuity 0.5 LogMAR). The audit data comprised 50% of UK cataract surgeries undertaken in 2017–2018. Whilst the EQ-5D-3L + VIS has not been used extensively, there is ongoing interest in the development of EQ-5D bolt-ons to fill perceived gaps in the core measures [35]. There is also considerable debate about which EQ-5D version and scoring algorithm should be used to measure self-reported health [36] and to inform decision-making [10]. Another strength is the patient-reported and objective measures of change in visual acuity collected in the study. It could be argued that patient perceived benefits are the outcome that should be targeted, as these might not correspond directly to clinical change. This study was able to test the responsiveness of the PBMs to both of these outcomes, replicating findings that visual acuity is not associated with generic PBMs [37,38,39].

A limitation of the study is that the three versions of the EQ-5D questionnaire were completed by different patient cohorts. If participants were to have completed every questionnaire, response burden would have been excessive. These cohorts were randomly assigned, relatively large and had similar baseline characteristics, nevertheless it is possible that some observed differences in validity and responsiveness might be due to chance. In addition, the study comprised cataract patients only. All participants had surgery, so experienced some change in clinical condition. Including a control group of cataract patients on the waiting list for surgery might have provided a more robust assessment of responsiveness. When evaluating the PBM performance, judgements were made on a series of thresholds and statistical tests. In some cases, decisions were based on marginal results. Whilst using arbitrary cut-offs is perhaps crude, decisions made on the triangulation of evidence is also subjective [40].

Comparison with existing research

The EQ-5D-3L ceiling effect at baseline was larger than previous studies in cataract patients 19.3% [41] 23% [42]), although not as pronounced as that observed by Gandhi et al. [18] (51%). The EQ-5D-5L ceiling effect reported by Gandhi et al. [18] (46%) also exceed the Predict-CAT results. Gandhi et al. [18] reported the performance of the EQ-5D-5L and EQ-5D-3L in a small cohort (n = 148) of cataract surgery patients and similarly found the EQ-5D measures to be inferior to alternative PBMs (HUI3 and SF-6D). The EQ-5D-5L was scored using four algorithms. Gandhi et al. [18] concluded the EQ-5D-5L is the preferred version due to its superior responsiveness (irrespective of scoring algorithm), however, our study does not support this. Gandhi et al. [18] did not examine responsiveness in relation to change in either patient-reported or objectively measured vision. Comparing the EQ-5D-5L scoring algorithms, EQ-5D-5L-VSE utilities were greater than the EQ-5D-5L-CW obtained values in our study. This is consistent with published comparisons [43]. Despite the addition of two more response categories, the five-level version showed no advantage over the three level. Neither version was consistently better when examining responsiveness, which is compatible with the mixed evidence available thus far [36].

Meaning of the study

This is the first study to administer the EQ-5D-3L, EQ-5D-3L + VIS, EQ-5D-5L and ICECAP-O concurrently and in a longitudinal study. Furthermore, the two available EQ-5D-5L scoring algorithms were applied [8, 9]. The addition of the vision bolt-on appears to improve the responsiveness of the EQ-5D-3L in this patient population, however it also seems responsive to the process of surgery in the absence of benefit. It was also the only EQ-5D variant to discriminate between participants with poorer and good visual acuity. The anchors used to measure patient-reported change required a reflection on their condition pre-surgery. This may introduce recall bias. In addition, assessing responsiveness as the difference between two assessments of your ‘health today’ perhaps does not reflect the change attributable to surgery. Complications or other negative experiences might be missed for example.

The poor association between vision and HRQL highlights challenges interpreting and appraising evidence of construct validity and responsiveness of PBMs. Firstly, PBMs have been valued by the general population, but clinical outcomes and other patient-reported outcomes are not. Associations between these measures might be improbable as a result. Furthermore, PBMs measure aspects of health and wellbeing unrelated to the condition. They are therefore not intended to be strongly associated with condition specific measures or clinical outcomes. Irrespective of this, there are certain properties that we would expect of a PBM. These include a small ceiling effect among a group of patients seeking care for visual problems affecting their QOL and being able to differentiate between patients who do and do not report improved QOL after a procedure of proven effectiveness, like cataract removal. The problems are probably related, with a high ceiling effects for the EQ-5D-5L and EQ-5D-3L leading to a lack of responsiveness.

Unanswered questions and future research

It seems that the EQ-5D-3L or EQ-5D-5L should not be the sole PBMs used in studies evaluating the cost-effectiveness of cataract surgery. This analysis has revealed evidence of limited responsiveness and poor construct validity in both EQ-5D-3L and EQ-5D-5L amongst cataract patients. This should be reflected in interpretations of cost-effectiveness analysis of interventions in cataract surgery patients. There is currently no evidence of the ICECAP-O’s content validity in this patient group. This was not within the remit of this study, but future qualitative work could be conducted to explore this. Future work could explore the suitability and performance of the ICECAP-A [44] in cataract patients given the potential importance of capability wellbeing in this context. The ICECAP-A measures capability wellbeing in adults as opposed to the focus on older adults in the ICECAP-O. Developed using qualitative research with UK adults, the five domains cover similar capabilities to the ICECAP-O, with some items reworded or the focus changed.

Whilst the ICECAP-O seems to be more responsive than the EQ-5D in cataract surgery patients, it cannot be used to generate QALYs. Yet without further methodological developments, neither can the EQ-5D-3L + VIS. There is no five-level vision bolt-on available, meaning a revised bolt-on is required, with concurrent robust valuation and validation. The methodological rigour and resources required to develop and value a bolt-on item challenges the feasibility of developing one for every condition that the EQ-5D is reportedly unsuitable. The endorsement of the EQ-5D for use in economic evaluation is largely justified by the need for a comparable measure of benefit. Bolt-ons lack comparability with core EQ-5D scores, however. Developing measures that are sufficiently broad to measure health-related wellbeing in all common conditions, without the need for bolt-ons should be prioritised. Finally, the ability to conclude what the best PBM is in cataract surgery patients should be informed by evidence comparing all PBMs available, such as the SF-6D [45] and HUI [46].

Conclusion

The Predict-CAT study intended to identify a suitable PBM for use in patients undergoing cataracts surgery. Referring to the psychometric properties suggested by Brazier et al. [20] for selecting PBMs cost-effectiveness models, no PBMs showed convincing evidence of all properties. While the ICECAP-O appears to be the most responsive generic PBM to improvements in QOL following cataract surgery, evidence of known-groups validity was consistently poor in all PBMs. There was no evidence that the EQ-5D-5L was more responsive than the EQ-5D-3L in cataract surgery patients, despite the increased number of response categories. This study suggests that the generic EQ-5D-3L and EQ-5D-5L may not reflect the patient benefits of cataract surgery when used in CUA. Where data allows, additional analyses using broader outcomes (e.g. ICECAP-O or EQ-5D-3L + VIS) should be presented to enable informed decision-making where CUA using EQ-5D data is recommended.