Background

Rare diseases (RDs) are defined by the European Medicines Agency (EMA) as diseases with a prevalence of fewer than 5 in 10,000 people and affect around 30 million individuals in the European Union (https://ec.europa.eu). These diseases are often severe, life-threatening and/or disabling, characterized by an early onset. As such, they negatively affect patients’ and their carers’ quality of life (QoL) [1], which should be appropriately account for when assessing the benefit of a new medicine. This is usually done via consideration of patient-reported outcome (PRO) data and/or health state utility values (HSUVs).

According to the Food and Drug Administration (US), PROs are defined as ‘any report of the status of a patient’s health condition that comes directly from the patient, without interpretation of the patient’s response by a clinician or anyone else’, and a patient-reported outcome measure (PROM) as an instrument, usually a questionnaire, that captures PRO data from patients (or proxies) to measure disease impact or drug effects, in a clinical trial or other study [2, 3]. For some PROMs, known as preference-based measures, an algorithm has been developed that converts patient responses into HSUVs based on preferences for specific health states derived from the instrument’s dimensions. HSUVs are single numerical values expressing preference-weighted QoL for particular health states and scored on a scale ranging from 0 (for a state equivalent to ‘dead’) to 1 (for a state equivalent to ‘full health’) [4], although some instruments provide also negative values for health states considered worse than death [5]. Among their applications, they are used for calculating the quality-adjusted life year (QALY), which combines HSUVs and survival data in a single metric and is one of the most preferred outcome measures for health technology assessment (HTA) [4].

A number of methods have been developed to estimate HSUVs, including direct and indirect approaches (i.e., preference-based PROMs). The most common direct techniques include standard gamble (SG) and time trade-off (TTO), where respondents are asked to choose between life with a lower QoL, and life in ‘full health’ with a risk of immediate death (in SG) or a shorter length (in TTO) [6]. A simple approach is represented by the visual analogue scale (VAS), where individuals are asked to rate health by selecting a value on scale where 0 is the worst state they can imagine and 10 or 100 is the best [4]. In recent years, discrete choice experiments (DCEs) have become frequently used to generate HSUVs by asking individuals to choose between hypothetical health states to elicit their preferred health state and the relative weights for various attributes included within health states [7]. The use of an indirect approach through preference-based PROMs is increasingly common and encouraged by many HTA bodies. Six main preference-based generic measures for use in HTA were identified by a recent review [8]: the EuroQol 5-Dimension (EQ-5D), the Health Utility Index Mark 2 or Mark 3 (HUI2/3), the Short Form 6 Dimension (SF-6D), the Quality of Well-Being (QWB) scale, the 15D, and the Assessment of Quality of life (AQoL). In the United Kingdom, the National Institute for Health and Care Excellence (NICE) recommends the use of EQ-5D converted to QALYs to measure the added benefit of new drugs [9]. In detail, companies, academic groups and others preparing evidence submissions for NICE should use the EQ-5D-3L value set for reference-case analyses. If data were gathered using the more recent EQ‑5D‑5L descriptive system, utility values should be calculated by mapping the 5L data onto the 3L value set, until a new high-quality 5L value set for England becomes available [10]. A further approach that is gaining consensus in research and HTA, also accepted by NICE in the absence of EQ-5D data [11], is ‘mapping’ from non-preference-based measures onto preference-based ones (e.g., EQ-5D) using an algorithm previously developed.

The estimation of HSUVs in RDs, however, is challenging. This is because direct techniques may be too demanding for patients, who are often young children or mentally disabled, while generic preference-based instruments may not capture all the relevant symptoms and disabilities, and ‘vignettes’ that describe standard hypothetical health states to non-patient populations may not reflect the heterogeneous manifestations of these diseases [12]. Similarly, the use of mapping in RDs is characterized by difficulties in recruiting sufficiently large samples needed to conduct the regression analysis to derive the algorithm, the limited overlap between the domains included in the disease-specific and generic PROMs, and the poor applicability of algorithms developed for non-rare conditions [13]. Disease-specific preference-based PROMs, which are more sensitive measures, exist for only few RDs (e.g., the Amyotrophic Lateral Sclerosis Utility Index, ALSUI) [14]. Caregiver QoL is increasingly considered in HTA, and several carer-specific preference-based instruments have been developed (e.g., ASCOT-Carer and CarerQol-7D) [15, 16]. However, their use and consideration in RDs remain limited [17, 18], even though most diseases affect young children impacting the QoL of those caring for them.

Work Package 10 of the EU Horizon 2020 project IMPACT-HTA (www.impact-hta.eu) aimed to investigate and develop guidance on use of PRO data and HSUVs in RD treatments for HTA. Results suggested PROs and HSUVs sometimes fail to demonstrate change in symptomatology or capture dimensions that really matter to patients, and their estimates are often uncertain and/or of poor quality due to insufficient evidence. The research also pointed to the likely impact of the nature of RDs on the methodological limitations identified, e.g., collecting PROs from paediatric, cognitively impaired, or heterogeneous populations [12, 19].

When preparing their HTA submissions, manufacturers usually review the available literature to identify and derive the parameters for their economic evaluations and/or to learn about methods, such as existing techniques to estimate HSUVs. Given the poor quality of the QoL evidence included in many HTA submissions [18], the question is whether manufacturers are making full use of the available literature, or whether the poor quality might be a consequence of the intrinsic limitations in applying the existing approaches to PRO and HSUV data generation in RDs. In order to answer this question, this paper aimed to further examine the approaches used to derive the HSUVs used in HTA submissions and compare them with the corresponding methodologies and recommendations arising from the literature. This was done by reviewing and comparing the methods used by manufacturers to derive HSUVs in NICE’s appraisal reports of orphan drugs with all published studies addressing the same RDs to understand whether manufacturers fully exploited the existing literature in developing their economic models and derive some related methodological learnings.

Methods

All treatments with an EMA orphan designation and appraised by NICE within their Technology Appraisal (TA) and Highly Specialized Technologies (HST) programmes until June 2020 were included in the study. Specialized or selected high-cost low-volume treatments for very rare conditions are typically evaluated within the HST procedure, while all other treatments undergo the TA process [20]. We excluded appraisal reports for cancer indications given that QALYs are more likely to be driven by survival rather than QoL increases, and any reports for conditions that are not included in the ORPHANET list of RDs [21].

The publicly available appraisal reports were downloaded from NICE’s website (https://www.nice.org.uk/). These reports include a summary of the evidence submitted by the manufacturer, a review of the manufacturer’s submission by an independent Evidence Review Group (ERG), and the Appraisal Committee’s appraisal of the evidence and final decision. The following information was extracted about the manufacturer’s approach(es) embraced to derive the HSUVs used in the economic model(s): type of technique(s) used, number and type of respondents, and literature sources consulted. Subsequent comments, criticisms, and suggestions made by the ERG and/or Committee relating to the techniques used and HSUVs results were also extracted and summarized in order to gain a better understanding of NICE’s opinion of these approaches or of what they would consider reasonable to obtain HSUVs.

We then performed a scoping review of published studies following the Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) checklist [22]. The scoping review was considered appropriate to synthesize the literature as this approach is suggested to map evidence and investigate how research is conducted on a certain topic (i.e., in this study, the estimation of HSUVs in rare diseases) [22, 23]. The aim was to retrieve all primary studies estimating HSUVs in the same indications addressed by the sample of NICE reports identified. An example of the research string used in PubMed is reported below for one NICE appraisal assessing Burosumab for X-linked hypophosphatemia (HST8 [24]):

((Burosumab)[Title/Abstract] OR (X-linked hypophosphatemia)[Title/Abstract]) AND ((health state utility values)[Title/Abstract] OR (utility values)[Title/Abstract] OR (health utilities)[Title/Abstract] OR (preference weights)[Title/Abstract] OR (index values)[Title/Abstract] OR QALYs[Title/Abstract] OR (cost-utility)[Title/Abstract] OR EQ-5D[Title/Abstract] OR EuroQol[Title/Abstract] OR HUI[Title/Abstract] OR (Health Utility Index)[Title/Abstract] OR QWB[Title/Abstract] OR SF-6D[Title/Abstract] OR 15D [Title/Abstract]).

The same string was applied to every drug-disease combination addressed by the NICE reports identified; drugs for the same indication were included in a single string (the full search strategy is reported in Supplementary File 1). The electronic searches were conducted until November 2020. The ScHARRHUD database (https://www.scharrhud.org/) holding bibliographic details of studies reporting HSUVs was also searched. All search results were extracted into an Excel spreadsheet and duplicates removed. Titles and abstracts were screened by two independent reviewers and records excluded if they did not meet the inclusion criteria; full-text papers were retrieved in case of doubtful results. Any disagreement was solved by discussion until a consensus was reached.

The studies deemed eligible for inclusion were those presenting an original approach for estimating HSUVs in any of the conditions of interest, i.e., we excluded those using estimates from previous studies already in the literature. Studies addressing multiple conditions (e.g., post lung transplantation patients) were included provided that at least one subsample of observations was related to the condition of interest (e.g., cystic fibrosis). Studies of any design (i.e., cross-sectional surveys, cohort studies, clinical trials, and cost–utility models) could be included in the review, provided that they showed an original approach to deriving HSUVs. Studies in languages other than English, conference abstracts and study protocols were excluded. Literature reviews were excluded but their reference lists were manually checked in order to identify any additional original studies not captured by the online searches. Similarly, economic evaluations using the published literature to obtain utility parameters for their analyses were excluded, but their reference lists were checked to avoid missing any relevant publications.

For each study, we extracted detailed information regarding the study characteristics (i.e., study design, setting, country, type and number of participants), the method(s) adopted to estimate HSUVs (e.g., direct approaches, such as SG or TTO, indirect approaches using preference-based instruments, such as EQ-5D or mapping), actual HSUVs estimates for the study sample and relevant subgroups, and authors’ methodological considerations regarding the approaches used to estimate HSUVs (if reported). The data extraction form (in Excel) was piloted in parallel by two reviewers on a sample of ten studies, and subsequently refined and completed independently by the first author. The study characteristics were tabulated and subsequently summarized through descriptive statistics (i.e., frequency distribution).

Then, the methods used by manufacturers to derive HSUVs, as reported in the NICE documents reviewed, were compared with those used in the published literature for each indication considered. Specifically, we aimed to understand whether manufacturers relied on the published studies (in the public domain at least one year before the NICE appraisal date) to retrieve HSUVs, or to replicate the techniques adopted to derive HSUVs. Otherwise, we sought to understand if divergences from the existing literature was motivated by the limitations highlighted by study authors’ around the technique(s) adopted and the results obtained. In addition, based on these authors’ methodological considerations, we summarized the key learnings that are important to account for when using each of the available methods to derive HSUVs in individual RDs, and whether the manufacturer’s approach could have better reflected any methodological advice in the existing literature.

Results

Synthesis of NICE technology appraisals

Of the 48 appraisal reports identified from the NICE website, we excluded 24 for cancer indications and two additional reports for Crohn’s and cytomegalovirus disease, since these conditions are not included in the ORPHANET list of RDs. The final sample included 22 TA/HST reports [24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45] for 19 different indications (one indication could have several treatments appraised), published between 2012 and 2019 (Table 1). Across these 22 manufacturer submissions, 16 (72.7%) derived their own HSUVs. Of these, 12 presented original data collection using a preference-based instrument (i.e., EQ-5D in 10 reports, HUI2 in one report, and HUI3 together with EQ-5D in another one). In eight of these (HST1, HST10, TA266, TA398, HST2, TA431, TA491, TA379), questionnaires were filled in by patients within a clinical study, while in the remaining four (HST6, HST8, HST11, HST12) questionnaires were administered to clinical experts to value hypothetical health state descriptions (i.e., ‘vignettes’). In two reports (TA276 and HST5), HSUVs were derived by mapping non-preference-based measures (Cystic Fibrosis Questionnaire, CFQ and Short Form-36, SF-36) onto EQ-5D. In one report (HST4), HSUVs were obtained through a discrete choice experiment (DCE), and in another (TA467) through a SG exercise, both performed with the general public. Finally, six reports (HST3, HST7, HST9, TA443, TA588, TA606) did not perform any empirical study, and relied only on literature values and/or expert opinion to obtain HSUVs (Fig. 1).

Table 1 Estimation of HSUVs in RDs: a review of NICE TA/HST guidance documents
Fig. 1
figure 1

Synthesis of methods used by manufacturers to obtain HSUVs (as reported in NICE TA/HST guidance documents)

In total, the use of ‘vignettes’ was reported in five cases (22.7%), of which four (HST6, HST8, HST11, HST12) using EQ-5D or HUI2/3, and one (TA467) SG, and ten reports (HST9, TA266, TA398, HST2, HST7, TA588, TA606, TA443, TA491, HST3) referred to published studies to obtain some or all HSUVs for their analyses. Cumulatively, 15 cases (68.2%) reported EQ-5D utilities obtained with different approaches: seven (HST1, HST2, HST10, TA379, TA398, TA431, TA491) collecting data directly from patients, four (HST6, HST8, HST11, HST12) using ‘vignettes’ to be valued by clinical experts, two (HST9, TA606) referring to published studies, and other two (HST5, TA276) using ‘mapping’.

The Committees and/or the ERG appraised six cases positively, where the HSUVs were derived either from EQ-5D data collected within a trial, or from different approaches considered to be reasonable (HST1, TA398, TA606, TA443, TA491, TA379). In 10 cases (HST2, HST3, HST5, HST6, HST8, HST9, HST11, HST12, TA431, TA588), the Committees highlighted several limitations in the approach used by the manufacturer (i.e., surveys of clinical experts instead of patients, small samples, EQ-5D value sets from other countries, data collected within observational studies instead of trials, vignettes with unclear disease elements) but recognized that these could be acceptable with some adjustments and/or cautious consideration of results. In the remaining six reports (HST4, HST10, TA266, TA276, HST7, TA467), the Committees were more sceptical about the approach adopted, considering that it might yield unrealistic HSUVs estimates. Consequently, they often preferred to rely on alternative de novo approaches developed by the ERG.

Literature search results

In total, the search strategy in PubMed identified 7445 articles. After removing 190 duplicates, 7255 records were scanned for title/abstract and 7105 were excluded in this first phase. Subsequently, 150 full-text articles were retrieved, and a further 68 records were excluded for not complying with the inclusion criteria. The two main reasons for exclusion were that the study evaluated QoL but did not estimate HSUVs, or that the study used HSUVs from the literature. Accordingly, 82 studies were selected, plus 29 studies identified through manual searches, resulting in a total of 111 studies included in the review [46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156]. No additional articles were obtained from the ScHARRHUD database (Fig. 2).

Fig. 2
figure 2

PRISMA 2009 Flow Diagram showing study selection

Synthesis of published studies

The 111 studies estimated HSUVs in the same RDs addressed by the NICE reports (Tables 2, 3). One study [152] was reported twice since it addressed two different conditions, thus leading to the consideration of 112 different studies. At least one study was identified for each RD addressed in the NICE reports, with the exceptions of paediatric-onset hypophosphatasia and neuronal ceroid lipofuscinosis. The highest number of publications were retrieved for cystic fibrosis (n = 36, 32.1%), idiopathic pulmonary fibrosis (n = 16, 14.3%), and Duchenne muscular dystrophy (DMD, n = 10, 8.9%).

Table 2 Estimation of HSUVs in RDs addressed by NICE TA/HST guidance documents: results from the literature review
Table 3 Synthesis of included studies (n = 112a)

In terms of study location, 27 (24.1%) were international studies involving two or more countries. Among the single-country studies, 23 studies (20.5%) were conducted in the UK, 22 (19.6%) in the US, eight (7.1%) in the Netherlands, six (5.4%) in Sweden, five (4.5%) in Canada, three (2.7%) in Australia, Finland and Germany respectively, two (1.8%) in France, Spain and Japan respectively, and one (0.9%) in Switzerland, Ireland, Bulgaria, Portugal respectively; in two cases (1.8%), the study location was not specified. The great majority of studies (n = 78, 70%) were published after 2010 (Fig. 3). Over half of the studies (n = 64, 57.1%) were aiming to evaluate QoL using preference-based questionnaires that (involuntarily) also allowed to derive HSUVs; of these studies, 22 (19.6%) were clinical studies using QoL as secondary endpoint, 30 (26.8%) were primarily aiming to estimate QoL, and 12 (10.7%) were assessing QoL and economic burden together. Only 15 (13.4%) explicitly aimed to estimate HSUVs. The remaining aimed to validate existing questionnaires in a specific patient population (n = 16, 14.3%), to calculate the cost-effectiveness of treatments (n = 14, 12.5%) or to quantify caregiver burden (n = 3, 2.7%).

Fig. 3
figure 3

Number of studies by publication year

The majority of studies (n = 72, 64.3%), all published from 2000 onward, used one of the EuroQol instrument versions (i.e., EQ-5D-3L, EQ-5D-5L or EQ-5D-Y) to estimate HSUVs. The QWB scale and HUI 2/3 were chosen as preference-based instruments by 11 (9.8%) and 14 (12.5%) studies respectively, although the former was never used after 2005, while the latter was more equally distributed over time. Cumulatively, 97 studies (86.6%) used at least one preference-based instrument. Direct utility estimation methods were less frequently adopted: 11 (9.8%) studies used SG, ten (8.9%) used visual analogue scale, seven (6.3%) used TTO, and one (0.9%) the risk–risk trade-off, for a total of 14 studies (12.5%) using at least one of these approaches. In three studies (2.7%), caregiver utility was estimated through the Carer-QoL-7D/-VAS. In three studies (2.7%) HSUVs were mapped from non-preference-based measures, while in other three (2.7%) the authors developed an original approach to obtain HSUVs, such as adjusting data from the literature.

The great majority of studies (n = 96, 85.7%) surveyed patients to obtain HSUVs; when the patient was a child, the parent/caregiver support or proxy-reporting was used by 11 (9.8%) studies. Caregivers also reported on their own health in 12 studies (10.7%) which aimed at quantifying the caregiver’s burden, exclusively or in addition to patient’s burden. Finally, very few studies recruited other types of responders such as clinical experts (n = 6, 5.4%) or the general public (n = 5, 4.5%) to derive HSUVs using proxy reporting (e.g., paediatricians valuing their patient’s health) or ‘vignettes’. In particular, the use of health state descriptions or case studies/histories (‘vignettes’) was reported by 9 studies only (8.0%). The sample size was classified as small (< 100 participants) in 42% of the studies (n = 47).

Synthesis of authors’ methodological considerations

Overall, 57 (out of 111) studies retrieved from the literature discussed the pros/cons of the techniques adopted to estimate HSUVs. Their authors’ methodological considerations are reported verbatim in Table 2. Of these, 22 (38.6%) were studies assessing QoL through preference-based PROMs (with or without economic burden), 14 (24.6%) estimating HSUVs as specific objective, and 13 (22.8%) assessing the validity or feasibility of questionnaires. This section presents a synthesis of these authors’ statements for each rare disease covered by the literature review, with the exception of three (i.e., atypical haemolytic uremic syndrome, severe combined immunodeficiency, and Waldenstrom’s macroglobulinaemia), for which no studies commented on the methods to estimate HSUVs.

Cystic fibrosis

Of the 22 studies providing methodological considerations, half (n = 11) commented on the use of the EQ-5D questionnaires (i.e., EQ-5D-3L/-5L and EQ-5D-Y) to estimate HSUVs. The overall evaluation was positive in seven studies [49, 50, 55, 59, 66, 67, 75], where EQ-5D was described as a valid, easy, and cheap generic instrument reflecting disease progression and severity, as well as allowing cross-disease comparisons and country-specific cost-effectiveness analyses. Conversely, four studies [48, 57, 81, 139] revealed its high ceiling effects and poor sensitivity in capturing symptoms (particularly, the self-care domain was considered irrelevant), and encouraged the use of alternative generic instruments (e.g., HUI3 and SF-6D), or, preferably, the development of a disease-specific preference-based measure (also including a respiratory domain).

The remaining 11 studies advised on alternative techniques to derive HSUVs. The judgment on QWB was mixed, with some studies [69, 74, 116] highlighting its low correlation with other measures of physical and emotional functioning, poor sensitivity in detecting changes in functional domains, or the limitations of its scoring system, while others [116, 121, 122] arguing that it showed responsiveness and validity in tracking changes in patients’ health, even during pulmonary exacerbations, and significantly correlated with measures of performance and pulmonary function. Similarly, the HUI was found to include relevant dimensions for cystic fibrosis patients, such as hearing and fertility [91, 153], but failing to discriminate between severity levels, and too difficult for paediatric patients to complete, thus requiring parental proxy-reporting [125, 153]. The only study [77] using CarerQol-7D showed that this instrument is easy and presents some external validity. Finally, two studies adopting direct methods to estimate HSUVs reached opposite conclusions, as one [62] argued that SG tends to produce higher values than other techniques, whereas the other [148] argued that, assuming individuals are neutral to risk, it is likely to underestimate utilities in patients that are keener to accept risk, such as those waiting for lung transplantation. In one study [116], the small number of patients (n = 20) limited to investigate the responsiveness, validity and reproducibility of the instruments adopted.

Idiopathic pulmonary fibrosis

The judgment on EQ-5D was overall positive, with several studies [58, 85, 142, 143] reporting that the instrument was sensitive to short-term health status changes and comparable to disease-specific questionnaires (e.g., K-BILD), and also allowed comparison with different lung diseases and the general population. However, EQ-5D also presented a significant ceiling effect in mild disease and may not fully capture disease-specific issues [58, 143]. Similarly, the QWB was considered not sensitive enough in distinguishing between heterogeneous disease’s manifestations and not correlated with physiological measures (e.g., dyspnoea scale) [65, 145].

Duchenne muscular dystrophy

The authors’ considerations on preference-based PROMs were generally negative since both EQ-5D and HUI lacked sufficient sensitivity to be used in mobility impaired patients [60]. In particular, the EQ-5D has no consideration for mobility beyond walking, thus underestimating the utility reduction for DMD children [60] and was considered insensitive in this condition [99]. The HUI did not correlate with caregiver’s rating of the patient’s health [100], and given its generic nature, was not considered to be applicable to patients with heterogeneous DMD presentations [97]. Conversely, one study [101] reported that the Duchenne Muscular Dystrophy Functional Ability Self-Assessment (DMDSAT) mapped well onto HUI in this patient population.

Other diseases

In Fabry disease, three studies highlighted the limitations of the use of EQ-5D-3L, such as its inability to detect small health changes, by having only three levels per domain [51], and to capture real health state changes, rather than patient’s perceptions of their health state [88]. Moreover, because of disease’s rarity, it was necessary to pool questionnaires in different languages to obtain a sufficiently large sample [128].

Conversely, two studies performed in hereditary transthyretin amyloidosis argued that EQ-5D showed validity and appropriateness, in spite of difficulties due to disease rarity and small samples [89, 119].

The three studies addressing inherited retinal dystrophies (IRDs) used very different approaches. One study [61] argued that TTO is easier to understand by patients compared to SG, and that the latter also overestimates risk aversion, since for each visual acuity stage a lower percentage of patients was willing to risk death (SG) than trade time of life (TTO) to return to perfect vision. The second study [70] reported that EQ-5D showed significant ceiling effects and may not be suitable for patients with genetic eye diseases, despite presenting the advantage of allowing comparison among alternative conditions. The third [107] reported that, due to the ultra-rarity of the disease, the study could not recruit patients and instead developed vignettes to describe IRDs health states, which were subsequently evaluated by six retina specialists using EQ-5D-5L. The authors’ judgment on this approach was overall positive, since descriptions of IRDs may not be easily understood by the public and the qualitative research that informed vignettes to be valued by clinical experts supported the face validity of the HSUVs obtained.

Among the studies performed in mucopolysaccharidosis type Iva (also known as Morquio A), one [86] reported that EQ-5D-5L was not sensitive enough in assessing pain severity if compared with the Brief Pain Inventory, while another [126] admitted having used it, because it was recommended by international guidelines. A further study [87] showed that medical experts (n = 6) and students (n = 93), used as surrogate respondents for patients who are often children with intellectual disability, reported comparable results using direct techniques. The authors also argued that indirect methods (e.g., EQ-5D) could be more practical with larger samples, although their validity in mucopolysaccharidoses was not demonstrated yet.

The only study [80] addressing X-linked hypophosphatemia reported that EQ-5D-5L may not capture disease-specific problems and that preferences in patients with rare bone diseases may not be aligned with those of the general population, which informed the instrument’s valuation system.

In spinal muscular atrophy (SMA), one study [56] discussed the pros and cons of using HUI3, which measures the ability to carry out daily activities (e.g., dressing) that are relevant for physically impaired SMA patients, but also dimensions such as seeing and hearing that are not impacted by the disease. A further study [106] highlighted the limitations of using vignettes to develop SMA case histories and asking clinical experts to value them using EQ-5D-Y. Indeed, standard descriptions of SMA states do not capture the heterogeneous manifestations of the disease, and the accuracy of HSUVs depends on the validity or accuracy of such descriptions and their proxy evaluation. However, the EQ-5D-Y has not been validated in children as young as patients with infantile-onset SMA and using physicians for proxy reporting allow to avoid some bias that parental judgment may introduce. Finally, a third study [156] highlighted the limitations of applying an existing mapping algorithm to SMA patients.

The only study dealing with severe refractory eosinophilic asthma [108] used both the generic EQ-5D and the disease-specific Asthma Symptom Utility Index (ASUI), commenting that the latter may show a floor effect for patients with severe asthma.

In Gaucher disease (GD) of type 1, the four studies providing methodological considerations used direct techniques to estimate HSUVs. The first [68] revealed that, among the techniques used for valuing vignettes by healthy volunteers, chronically ill patients and GD patients, the risk–risk trade-off yielded significantly lower utilities than SG and TTO and showed the poorest test–retest reliability. The second study [79] showed that the WTP approach (used to determine the maximum amount healthy volunteers were willing to pay to have an insurance scheme covering the treatment cost) produced results aligned with SG (used to determine the maximum risk of death they were willing to bear to avoid living with GD), and may be well-suited to elicit preferences for chronic conditions where it is unrealistic to trade-off life duration (as in TTO) or risk of immediate death (as in SG). The third study [84] pointed out the limitations of using health state descriptions that were developed without direct inputs from GD patients and valued by the general population. The fourth study [104] reported that using different search procedures (i.e., the method—titration and “ping-pong”—used to find the point of indifference) significantly impacted the HSUVs estimated by healthy volunteers.

The only study providing methodological considerations in primary biliary cholangitis suggested that HUI performed well since was able to discriminate between patients with mild and severe disease [154].

The only study [138] addressing limbal stem cell deficiency (LSCD) pointed out the limitations of using vignettes to be valued by members of the general public, since some participants failed to correctly interpret LSCD descriptions, and there was limited overlap between the EQ-5D-3L dimensions and LSCD vignettes, despite the inclusion of the vision bolt-on. However, given the rarity of the condition and the small sample achievable, deriving utilities directly from patients would have provided unreliable estimates.

In hereditary angioedema, a study mapping the HAE-BOIS survey items onto EQ-5D-3L reported that the items selected for manual cross-walking were conceptually overlapping with EQ-5D dimensions [53]. The two studies by Nordenfelt et al. [117, 118] reported that EQ-5D-5L is suitable to assess QoL in this condition, and also includes the pain element, as opposed to disease-specific instruments, such as AE-QoL.

Comparison between NICE reports and published studies

Table 4 summarizes the approaches used to derive HSUVs included in the manufacturer NICE submissions (as reported in TA/HST guidance documents, n = 22) with the existing published studies (n = 111) that estimated HSUVs for each individual rare disease considered in this review, irrespective of the publication date and the consequent availability (or not) to manufacturers for developing their economic models.

Table 4 Comparison between manufacturer and literature approaches to estimate HSUVs in RDs

The use of ‘vignettes’ valued by non-patient populations (i.e., clinical experts or members of the public) was more frequent in NICE technology appraisals than in the literature (5/22, 22.7% vs. 9/112, 8.0%), as well as the use of ‘mapping’ from non-preference-based measures (1/22, 9.1% vs. 3/112, 2.7%). In 36.4% (8/22) of cases, the manufacturers used preference-based PROMs (i.e., EQ-5D and/or HUI) collected from patients (and/or caregivers), while this methodology (alone or in combinations with others) was adopted by 83.0% (93/112) of published studies. Moreover, the literature presented a much wider range of preference-based PROMs including also the generic QWB, SF-6D, and 15D, the CarerQol-7D (specifically for caregivers), and the disease-specific ASUI. The use of direct methods (e.g., SG, TTO, VAS, RRTO) was reported by one manufacturer submission only (1/22, 4.5%) versus 14 published studies (12.5%). One NICE document (4.5%) reported on a DCE performed with the general public, while this technique was not found in the literature reviewed. Finally, in ten NICE appraisals (45.5%) the manufacturers obtained utilities from the literature.

In relation to individual RDs, a high level of agreement in the methodological choices was observed in atypical haemolytic syndrome, where NICE technology appraisal (HST1) and the three studies retrieved from the literature [76, 103, 105] collected EQ-5D directly from patients, as well as in severe refractory eosinophilic asthma, where original utilities were estimated using EQ-5D in TA431, as in the study by Lloyd [108] with the addition of ASUI. Similarly, in Waldenstrom’s macroglobulinemia, both the NICE report (TA491) and published study [73] collected EQ-5D-5L from patients, as well as in idiopathic pulmonary fibrosis, where TA379 and 14 (out of 17) studies used EQ-5D (either 3L or 5L). In cystic fibrosis, the use of preference-based instruments (i.e., EQ-5D and HUI2) and mapping from CFQ was common between technology appraisals (TA266, TA276, TA398) and most of the studies retrieved (22 out of 36).

Conversely, a low agreement between the approaches used in the NICE report and the literature was found in relation to Fabry disease, where the former (HST4) performed a DCE with members of the general population, while the latter collected EQ-5D-3L in eight cases [51, 54, 88, 114, 115, 128, 131, 144] (and, in one study [152], also SF-36 to be valued using SF-6D). In addition, in X-linked hypophosphatemia, the technology appraisal (HST8) estimated HSUVs using vignettes valued by clinical experts through EQ-5D-5L, while in the only study retrieved [80] the same instrument was used to collect data directly from patients. In spinal muscular atrophy, the technology appraisal (TA588) relied on expert opinion and the literature to derive HSUVs, while a variety of approaches (including preference-based instruments collected from patients, caregivers, or clinical experts using vignettes) were presented by the five studies retrieved [56, 64, 106, 110, 156].

In Gaucher disease, the manufacturer’s approach (HST5) consisted in mapping SF-36 onto EQ-5D using a published algorithm; conversely, three published studies [71, 147, 152] collected preference-based measures from patients, while another four [68, 79, 84, 104] used vignettes to be valued using various direct techniques. In DMD, the manufacturer (HST3) derived utilities from the literature (because the algorithm available to convert PedsQL data collected in the clinical study onto EQ-5D utilities was developed in a healthy population), while several preference-based measures (i.e., EQ-5D, EQ-5D-Y, HUI2/3, CarerQol-7D) were used in eight (out of 10) published studies [60, 63, 97,98,99, 101, 102, 123].

Degree of exploitation of the existing literature in manufacturers’ submissions

In Table 5, the methods chosen by manufacturers to obtain HSUVs are compared with: (1) the alternative techniques recommended by NICE ERG and/or Committee (excluding suggestions of minor adjustments that do not alter the main technique, as reported in Table 1) and (2) the techniques adopted in studies that were published at least one year before the TA/HST date (and, therefore, presumably available to manufacturers at the time of drafting their submission), together with a synthesis of study’s authors methodological considerations.

Table 5 Synthesis of techniques used by manufacturers, suggested by ERG/Committee, and used by published studies to estimate HSUVs

Overall, 72 studies (out of 112, 64.3%) were assumed to be available to manufacturers at the time of their drug submissions. In detail, five studies [75, 102, 117, 150, 154] included in our literature review were explicitly mentioned by the manufacturers as a source to obtain utilities that were not collected in their clinical studies (HST3, TA398, TA443, TA606), or to derive published coefficients [75] to perform a mapping exercise (TA276). In three cases [75, 117, 154], the studies provided positive recommendations regarding the instruments used (i.e., EQ-5D-Y, EQ-5D-5L and HUI2), while in the remaining two [102, 150] no considerations were given by the authors. The report TA443 referred to another study (full reference not provided), already used in a TA on sofosbuvir for treating chronic hepatitis C. One report (HST9) used EQ-5D utilities estimated in a previous study (full reference not provided) on the same condition (i.e., hereditary transthyretin amyloidosis), but with a different country value set. One report (TA491) addressing Waldenstrom’s macroglobulinaemia referred to two published studies [157, 158] on chronic lymphocytic leukaemia (which were not captured by our review, since we excluded cancer). In four cases (TA266, HST2, HST7, TA588), no details were given on the literature source used.

Overall, the association between study authors’ methodological considerations and manufacturers’ choices is not straightforward. In some cases, such as TA398, the use of EQ-5D is corroborated by several studies [49, 50, 59, 67] arguing the instrument is a valid and responsive instrument in cystic fibrosis. Similarly, in idiopathic pulmonary fibrosis, five (out of seven) available studies at the time of NICE assessment report (TA379) used EQ-5D and one [85] also reassured on its sensitivity in this condition. In other cases, such as HST2 and HST8, the manufacturers adopted EQ-5D despite the negative judgment reported by the literature, which showed that it did not correlate with pain measures in Mucopolysaccharidosis type IVa [86] or considered it as unable to capture specific issues in X-linked hypophosphatemia [80]. In IRDs, despite EQ-5D was shown to present significant ceiling effects and considered unsuitable for patients with eye conditions [70], the manufacturer (HST11) used both EQ-5D and HUI3, while the ERG suggested to refer to a previous study deriving TTO values from the general public. In these cases, the choice of using EQ-5D is more likely motivated by the willingness to comply with NICE’s recommendations than with the literature’s considerations. The first report on cystic fibrosis (TA266), in using the HUI3, might have followed some positive judgments on this instrument as reported by studies available at that time [50, 91, 125, 153], which highlighted its responsiveness, ease of understanding, and inclusion of relevant items, such as hearing and fertility, while the evidence on the use of EQ-5D was still scarce. Finally, there are few cases where the manufacturers’ methodological choices seem completely unrelated to the literature. For example, in HST4, a DCE with the public was presented, in spite of the availability of multiple studies using EQ-5D and suggestion from ERG to use values from the literature (unspecified).

Discussion

Synthesis of results

This study aimed to identify, examine and compare the approaches used in NICE technology appraisals and published literature on selected RDs in order to understand the level of agreement between the two and whether the manufacturers fully exploited the existing literature in developing their HTA submissions. A total of 22 TA or HST appraisals encompassing 19 different conditions were downloaded from NICE website. In a substantial number of submissions (10/22, 45.5%), the manufacturers used HSUVs from the literature, as the only approach or combined with primary utility data collected in clinical trials, to inform specific health states (e.g., lung transplantation in reports on cystic fibrosis, TA266 and TA398). However, in more than one third of submissions (8/22), preference-based instruments (EQ-5D or HUI2) were collected from patients during clinical studies. The comments of NICE committees on the approach adopted by manufacturers to obtain utilities for their models were heterogeneous; as expected, more appreciation was expressed when the manufacturers collected EQ-5D directly from patients during a clinical trial (which is the NICE’s recommended approach [9]). In few cases the ERG suggested a completely different approach to estimate HSUVs (e.g., in TA276, using literature values instead of mapping) while more often it recommended only minor adjustments to the manufacturer’s choices.

A total of 111 published studies were identified from the literature, of which one was counted twice because of addressed two different conditions in two separate article sections; therefore, data analysis was performed on 112 studies. One third were related to cystic fibrosis that, with a current prevalence of 11.1 per 100,000 [21], is one of the most common RDs. Almost one fourth were conducted in multiple countries, which is frequent in RDs due to high patients’ geographical dispersion [12]. More than two thirds were published in the last decade (2011–2020), which demonstrates a growing attention for QoL in RDs (and in general), and its incorporation into HTA appraisals [16, 19].

A total of 97 studies (86.6%) used at least one preference-based instrument to estimate HSUVs. Of the six main generic PROMs reported in the literature [8], all were used except for the Australian AQoL. Of them, the most frequently adopted (in 64.3% of studies, 72/112) was by far a EuroQol questionnaire (EQ-5D-3L, EQ-5D-5L or EQ-5D-Y), followed by HUI2/3 (14/112, 12.5%). The three-level version of EQ-5D (i.e., EQ-5D-3L) was first introduced by EuroQol in 1990 (www.euroqol.org), but never used in this review before 2000. The HUI, introduced in 2003, has been rarely but constantly adopted in the included studies since the beginning of the new millennium. The QWB, developed in the 1970s, was adopted by 11 studies (9.8%), all published before 2005, and subsequently likely replaced by more recent instruments. The 15D was used only by three Finnish studies [46, 47, 83], and the SF-6D by one study only [152] (but addressing two different RDs) and together with EQ-5D. The use of direct techniques was much less frequent; indeed, only 14 studies (12.5%) relied on these methodologies, of which 11 (9.8%) used SG. Finally, only three studies (2.7%) [48, 53, 101] performed a mapping exercise to obtain HSUVs from non-preference-based measures. This is not surprisingly, given the limitations related to the use of mapping in RDs emerged from our previous review [13].

In 11 studies (9.8%), parents (or other caregivers) were involved in proxy-reporting on patients’ status in addition to or in place of their children (especially if aged below 12). This review confirmed a growing consideration of caregivers’ and, in general, families’ well-being in research [16], since 12 studies (10.7%) interviewed the main carers about their own status, and also in HTA [15], as observed in three NICE appraisals (TA588, HST3, HST12) and related positive comments from ERG. The participants enrolled were less than 100 in 42% of studies, which is frequent in RDs where small-scale studies are usually conducted [12, 16, 19].

The degree of overlap between the methods adopted by manufacturers for their submission to NICE and published studies was quite limited and varied a lot according to the disease considered. In particular, the use of ‘vignettes’ valued by clinical experts and reliance on expert opinion to estimate HSUVs was much more frequent in manufacturers’ submissions than in the published literature (5/22, 22.7% vs. 9/112, 8.0%), where the great majority of studies (85.7%) surveyed patients directly to value their own status. It is worth noting that four (out of five) NICE technology appraisals reporting the use of vignettes were HST targeting ultra-RDs, in which it may be difficult to recruit patients. The reasons for diverging from the existing literature might be different. In some cases (e.g., HST6 and HST12), we realized that there were no published studies to refer to, while in others (e.g., HST1 and HST8) very few studies were available at the time of submission to NICE. Again, this situation was much more common for HST targeting ultra-rare diseases. In the submission related to IRDs (HST11), the company expressed a preference for HUI3 instead of EQ-5D, recommended by NICE and used by a previous study [70]), because the former contains a vision component while the latter is known to have poor validity in eye disorders [30]. In other reports, the patients’ characteristics, in terms of age or disease’s subtype, might be different from the samples enrolled in published studies, thus inducing manufacturers to perform a new data collection or embrace different approaches. For example, the eculizumab submission (HST1 [26]) also included paediatric patients (0–11 years), while the study retrieved in atypical haemolytic–uremic syndrome recruited only patients above 12 [103]. Similarly, burosumab (HST8 [24]) was authorized for children and adolescents, while the corresponding study on X-linked hypophosphatemia was focused on adults [80]; this might have incentivized the manufacturer to conduct an ad hoc vignette study where clinicians, instead of patients, completed the EQ-5D. However, it is difficult to identify a clear pattern in manufacturers’ choices, but in several cases, they seem unrelated to the literature. The choice of using EQ-5D, indeed, is more likely motivated by the willing to comply with NICE methods guidance [9]. In the appraisal reports reviewed, manufacturers rarely referred to (or commented on) the available literature on HSUVs. In few cases, the company admitted using literature data, because QoL or utility data were not (or not sufficiently) collected in clinical trials (HST7, TA606, TA443), or there were no suitable mapping algorithms to convert PROMs used in the clinical studies (e.g., Norfolk QOL-DN, PedsQL) to the EQ-5D or other utilities (HST3, HST9, HST11). In other cases (HST8, HST11), the manufacturers reported the lack of literature data (and/or mapping algorithms) and, therefore, the necessity to commission an ad hoc vignette study. Future research should adopt qualitative methods (e.g., focus groups, interviews) to investigate the deep reasons why manufacturers exploited (or not) the literature to inform their new drug submissions.

Limitations

Our research has several limitations. First, in order to limit the number of studies to be included in the literature, we deliberately excluded those using secondary sources to derive HSUVs. Thus, we could not compare NICE technology appraisals and published studies in terms of the frequency in use of such an approach. Second, we only searched two databases (PubMed and ScHARRHUD), while using other sources (e.g., Embase) and checking the reference lists of the included studies could have provided additional results. Third, except for the pilot phase, the data extraction was done independently by a single author, while should be conducted by at least two people [159]. Fourth, we did not perform any quality assessment of the included studies; however, this task is considered optional by PRISMA-ScR (item 12) [22]. Fifth, although we considered studies published at least one year before the NICE report date to understand whether manufacturers fully exploited the literature, it is possible that they develop their economic analyses more than one year ahead of the submission to NICE, in which case they would not be able to consult and reference some of the studies used for comparing the approaches. Therefore, the comparison between NICE reports and the literature should be interpreted with caution and not be viewed as direct criticism of manufacturers, but rather as an encouragement for using different approaches or values existing in the literature in future submissions, given the availability of estimates and the methodological considerations discussed by authors.

Key learnings

This review provided several insights into research and HTA. The critical issues highlighted in previous studies undertaken within the IMPACT-HTA project [12, 13, 19] and related to the general use of PROMs and the estimation of HSUVs in RDs were confirmed by the manuscripts reviewed, as well as their authors’ considerations. First, due to the low prevalence of RDs, the samples recruited in studies were generally small and might have affected the validity of results. In several cases, the ultra-rarity of condition (e.g., IRDs) hampered the enrolment of a representative patient sample and required the use of ‘vignettes’. The wide use of non-patient populations is a key issue in the estimation of HSUVs in RDs, since most patients are children or cognitively impaired, and cannot perform complex tasks, such as those required by direct techniques. Second, even in studies using simpler indirect approaches (e.g., EQ-5D), patients were often children who necessarily required parental support in order to fill in the required questionnaire to obtain HSUVs. The use of paediatric versions of instruments (e.g., EQ-5D-Y) can overcome the issue only partially, since they do not apply to mentally disabled or very young children.

Third, despite their advantages (i.e., simplicity, cheapness, comparability across different conditions), preference-based PROMs showed poor sensitivity in several RDs, due to their heterogeneity and peculiarities. For example, in conditions implying severe physical impairment (e.g., DMD), the EQ-5D does not consider mobility issues beyond walking, or do not include visual domains for eye conditions (e.g., limbal stem cell deficiency), unless adding the vision bolt-on. Moreover, EQ-5D also presented ceiling effects in mild disease. Fourth, patients’ geographical dispersion is likely to have required multi-country research studies and multi-lingual instruments, which inevitably imply more logistic and financial resources.

In this review, a variety of approaches were identified to estimate HSUVs. EQ-5D, which is currently the most frequently cited instrument (85%) in national pharmacoeconomic guidelines [160], was also the preferred instrument in the studies reviewed, with over 64% using it alone or in combination with other approaches. Similarly, 15 out of 22 (68.2%) manufacturers’ submissions presented EQ-5D utilities derived in different ways (i.e., from patients, from clinical experts using ‘vignettes’, by ‘mapping’ non-preference-based measures, or using values from the literature), in agreement with NICE guidelines indicating that EQ-5D can be sourced from the literature or by mapping, if not available in the clinical trials [9]. On the basis of methodological considerations extracted from published studies, EQ-5D is found to be an easy, valid and sensitive instrument in chronic lung diseases (i.e., cystic fibrosis and idiopathic pulmonary fibrosis), and also allows comparison with other conditions. However, it also includes irrelevant aspects (i.e., self-care) and present significant ceiling effects, especially in less severe health states. The instrument was presented as valid and appropriate also in hereditary transthyretin amyloidosis and hereditary angioedema. In DMD, instead, EQ-5D is valued as almost insensitive in capturing relevant disease’s changes, especially because does not consider any mobility aspects besides walking. The main EQ-5D limitations (i.e., poor sensitivity, ceiling effects) were also highlighted in other rare conditions (i.e., Fabry disease, IRD, mucopolysaccharidosis type Iva, X-linked hypophosphatemia). The NICE, on the other hand, acknowledged that the EQ-5D may not be the most appropriate instrument in some cases, and alternative QoL measures may be used provided that the lack of content validity for the EQ-5D is proven [9]. For example, in HST10, NICE’s committee considered that EQ-5D might not fully capture all the effects of autonomic neuropathy [29].

The comments about the other two frequently used preference-based PROMs (i.e., HUI2/3 and QWB) identified in the literature review (which were used by 12.5% and 9.8% of published studies, respectively) were heterogeneous. Similar to EQ-5D, HUI2/3 was found to perform well in rare lung diseases and include relevant dimensions (e.g., hearing, fertility) for cystic fibrosis patients, but not relevant enough for physically impaired patients, such as those affected by DMD and SMA. QWB was used by eleven studies all performed in chronic lung diseases (i.e., cystic fibrosis and idiopathic pulmonary fibrosis), where the evaluation of its performance and validity was mixed. In relation to direct techniques, it is worth mentioning that two studies [61, 148] stated that SG tends to over or underestimate HSUVs, depending on respondent’s willingness to accept the risk. Finally, the use of ‘vignettes’ valued by non-patient populations was often perceived as a ‘second-best’ approach justified by low prevalence of disease and consequent difficulties in recruiting patients [107]. Indeed, some studies [84, 138] highlighted the limitations of such an approach, mainly regarding standard descriptions of health states that do not capture the heterogeneous manifestations of diseases, development of vignettes in the absence of patient’s inputs, limited overlap between health state descriptions and the instrument (e.g., EQ-5D) used to value them, and difficulties for the general public to identify themselves in hypothetical RD patient descriptions.

Conclusions

This is the first paper that analyses and compares the methods used to estimate HSUVs in NICE TA and HST appraisal of drugs with an EMA orphan designation with the published literature for the same targeted RDs. The study confirmed that EQ-5D is the preferred preference-based instrument, as recommended by NICE and other HTA authorities. The agreement on methodological choices between the two types of documents (i.e., NICE appraisals and published manuscripts) was only partial, since the former relied more often on ‘vignettes’ and clinical experts. Despite almost half of the appraisal reports reviewed referred to the literature (sometimes even without specifying the study referred to), more efforts should be made by manufacturers to exploit comprehensively the methods, results and considerations found in the published studies, which occasionally advises against using EQ-5D in selected RDs. At the same time, since methodological considerations retrieved from the existing literature are often inconsistent, future research could further investigate the most appropriate methods to estimate HSUVs for individual RDs and identify the ways to overcome the limitations of existing instruments. However, the wide use of EQ-5D in published research studies should provide some reassurance concerning the validity of this instrument at least in some RDs, such as those affecting respiratory system (i.e., cystic fibrosis and idiopathic pulmonary fibrosis). The use of condition-specific preference-based measures, which was very limited in our review, is also encouraged at least to determine their validity and responsiveness in comparison with the generic ones. Moreover, a deeper understanding of proxy-reporting for paediatric or cognitively impaired patients is required, to assess the validity of this frequently used approach in RDs, as well as the impact of including their carers’ QoL on HTA results.