Introduction

Quality-adjusted life years (QALYs) are a popular metric to evaluate the cost-effectiveness of care interventions [1,2,3,4]. However, a common evidence gap exists between available clinical measures of effect and the detailed preference-based information (e.g. utility scores) needed to estimate QALYs [5]. Within mental health trials, patient-reported outcome measures (PROMs) like the Patient-Health Questionnaire-9 (PHQ-9) and Generalised Anxiety Disorder-7 (GAD-7) are commonly used (often together) to capture depression and anxiety severity, respectively [6,7,8]. These measures are also routinely collected by mental health services such as Improving Access to Psychological Therapies (IAPT) services (now called NHS Talking Therapies) in England as part of their patient-based performance metrics [6, 8,9,10]. However, such PROMs do not have preference-based value sets to enable cost-per-QALY estimates to be interpreted relative to thresholds to infer cost-effectiveness, e.g. in England and Wales, the National Institute for Health and Care Excellence’s (NICE’s) £20,000 to £30,000 per QALY threshold [4, 11, 12].

Preference-based PROMs like the EQ-5D three-level (EQ-5D-3L) and five-level (EQ-5D-5L) versions have country-specific preference-based value sets for the estimation of QALYs and are favoured by health technology assessment organisations internationally, including NICE [1,2,3,4]. However, existing empirical evidence indicates limitations of the EQ-5D measures in mental health populations, recommending a more mental health focussed preference-based measure for mental health service users [13,14,15,16,17,18,19,20]. The Recovering Quality-of-Life 20-item (ReQoL-20) and 10-item (ReQoL-10) are two such PROMs capturing ‘recovery-focussed quality-of-life’ for mental health service users [21]. A UK preference-based value set has been developed to calculate QALYs from seven ReQoL-10 items: the ReQoL Utility Index (ReQoL-UI) [22]. Key differences in ReQoL-UI and EQ-5D-5L design, utility score distributions, psychometric properties, and subsequently estimated QALYs have been assessed and discussed [23, 24].

Preference-based measures like the EQ-5D-5L or ReQoL-UI are frequently absent from clinical studies or routine service data collection, which prevents direct QALY calculation. The term ‘mapping’ is used to describe the process of estimating a statistical relationship between observed clinical outcome measures and preference-based measures using an estimation dataset containing both types of information. The estimated ‘mapping’ model can predict missing preference-based scores for clinical studies or care services based on observed clinical outcome measures. However, the distribution of preference-based scores tend to exhibit characteristics that make standard regression-based models such as linear and Tobit regressions inappropriate for mapping and their use should be discouraged, despite traditionally being common practice [25,26,27]. Specifically for mapping, adjusted limited dependent variable mixture models (ALDVMMs) were first proposed by Hernández Alava et al. [28] to deal with the distributional features presented by the EQ-5D-3L, with supportive evidence when modelling other preference-based scores such as EQ-5D-5L [26, 29]. Alternative mixture models, such as mixture beta regression models (Betamix), might also have benefits relative to ALDVMMs dependent on the utility scores underlying distribution [30,31,32].

Our overall aim is to map from the GAD-7 and PHQ-9 to the ReQoL-UI or EQ-5D-5L based on ‘best practice’ mapping methods using an estimation dataset obtained from an IAPT-based trial population [24, 33, 34]. To accomplish this aim, we firstly use ALDVMMs to map from the GAD-7 and PHQ-9 to the ReQoL-UI to enable QALY estimation. Secondly, the availability of the EQ-5D-5L in the estimation dataset provides an opportunity to investigate previously raised issues around the appropriateness of mapping from PHQ-9 and GAD-7 to generic measures such as the EQ-5D-5L [16]. This second objective is complicated by the fact EQ-5D-5L responses can be assigned utility scores using country-specific value sets, such as the current EQ-5D-5L value set for England (VSE) or United States value set (USVS), or predicted EQ-5D-3L utility scores using an existing mapping function [35,36,37]. In England and Wales, NICE does not recommend the VSE, instead previously recommending the ‘cross-walk’ by van Hout et al. [36]; however, since January 2022, NICE changed its recommendation from the cross-walk to the mapping function developed by the NICE Decision Support Unit (DSU) [4, 38,39,40]. Work is ongoing to recommend the most appropriate way to map to the DSU mapping function, and is therefore not included in our analysis. Instead, mapping to three EQ-5D-5L utility scores (i.e. VSE, USVS, and cross-walked) provide additional insights into the suitability of mapping to generic preference-based measures given the marked differences across their distributions [23, 41,42,43].

Outcome measures

Appendix S1 provides a summarised overview of all PROMs.

Mental health measures

The PHQ-9 is a self-reported screening for depression measure reflecting the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition—Text Revision (DSM–IV–TR) criteria [8, 44, 45]; summary score range: 0 (minimal depression) to 27 (severe depression).

The GAD-7 is a self-reported symptoms and severity of anxiety measure based on the DSM-IV GAD diagnostic criteria [7]; summary score range: 0 (minimal anxiety) to 21 (severe anxiety).

The PHQ-9 and GAD-7 are commonly used together to measure depression and anxiety symptomology, given the often comorbid nature of depression and anxiety [46, 47]. For example, IAPT services have operationalised the aforementioned based on ‘caseness’ (PHQ-9 ≥ 10; GAD-7 ≥ 8) and ‘reliable improvement’ (PHQ-9 absolute change ≥ 6; GAD-7 absolute change ≥ 4) threshold values as part of IAPT’s patient-based performance outcomes [6, 8,9,10]. As such, the measures’ summary scores (but not always the item scores) are routinely recorded for IAPT patients.

Target measures and utility scores

ReQoL-UI

The ReQoL-UI classification system is based on seven ReQoL-10 items each with five severity levels, covering seven themes of self-reported recovery-focused quality-of-life [22]: autonomy; well-being; hope; activity; belonging and relationships; self-perception; physical health. The ReQoL-UI is described as having two overall dimensions: a mental health (six items) and a physical health (one item) dimension [22]. The ReQoL-UI represents (75) 78,125 possible health states, with a score range from − 0.195 (worst state) to 1 (best state).

EQ-5D-5L

The EQ-5D-5L is a self-reported, generic health measure with five severity levels, over five dimensions/items: mobility; self-care; usual activity; pain/discomfort; anxiety/depression [48, 49]. The EQ-5D-3L is a previous version of the instrument which uses the same dimensions but with only three severity levels. The EQ-5D-5L’s classification system is able to represent (55) 3,125 health states, compared to the EQ-5D-3L’s (35) 243 health states. EQ-5D-5L utility scores can be estimated using either a direct value set or through using a mapping (‘cross-walk’) function to a EQ-5D-3L value set [35, 36]. Here we focus on two value sets, VSE and USVS, and the van Hout et al. [36] ‘crosswalk’ which maps to the EQ-5D-3L UK value set.

The cross-walk used a non-parametric response mapping method to predict values that are linked to the EQ-5D-3L value set. This method is based on independent cross-tabulations of EQ-5D-3L and EQ-5D-5L for each dimension and some assumptions about the allowable response patterns. In particular, it is assumed that any response at the lowest (highest) severity level of EQ-5D-5L always corresponds to a response at the lowest (highest) severity level of EQ-5D-3L; therefore, the cross-walk produces a EQ-5D-5L value set with the same range as the EQ-5D-3L UK value set, ranging from 1 (best state) to − 0.594 (worst state). As such, cross-walked utility scores mildly mimic distributional aspects of the original EQ-5D-3L UK value set [50].

In comparison, the VSE’s and USVS’s value range is smaller than the EQ-5D-3L’s/cross-walk’s, from − 0.285 or − 0.573 (worst state) to 1 (best state), respectively, when assigned to the EQ-5D-5L’s 3125 health states.

Methods

Pre-mapping considerations: conceptual overlap and existing mapping studies

An important pre-mapping consideration suggested by ISPOR guidance is the extent of overlap between the clinical outcomes measures and target preference-based measure/score; if there is little overlap, mapping success is unlikely [34]. Measures’ conceptual and practical overlap can be examined using psychometric methods (for example assessing correlations and effects sizes) and additional learnings derived from previous mapping studies.

In terms of psychometrics, EQ-5D measures’ results offer better support in common mental health disorders such as anxiety and depression compared to severe disorders like schizophrenia and bipolar disorder [16,17,18,19, 51]. Relatedly, the ReQoL-UI’s and EQ-5D-5L’s relative psychometric properties have been assessed in general and mental health populations [24, 52]. Against the PHQ-9 and GAD-7 in IAPT patients, Franklin et al. [24] concluded the ReQoL-UI has relatively better construct validity with the PHQ-9; however, the EQ-5D-5L had relatively better construct validity with the GAD-7.

The mapping literature is sparse in this area, limiting the insights that can be obtained. A 2019 systematic review of mapping studies by Mukuria et al. [25] identified a single study focussed on mapping from mental health measures (e.g. PHQ-9 and GAD-7) to preference-based measures (EQ-5D-3L and SF-6D) [16]: Brazier et al. [16] questioned the appropriateness of mapping from mental health measures to generic preference-based measures based on their mapping performance statistics. However, Brazier et al. [16] analyses did not include mixture models, rather they focussed on more traditional OLS, Tobit, and response-level mapping models. One other study ‘mapped’ from the PHQ-9 to the EQ-5D-3L using a non-regression-based approach (i.e. equipercentile linking), however, limited reported results restricted performance assessment of this approach [53,54,55]. A non-peer-reviewed study mapped from the Health of Nation Outcomes Scale (HoNOS) to the ReQoL-UI, which is the only previous study we identified which mapped to the ReQoL-UI; however, this study only used an OLS model and the HoNOS is clinician not patient-reported, which may have contributed to the authors suggesting caution when using their mapping functions.

Estimation data source

The estimation dataset was obtained from a parallel-groups, randomised waitlist-controlled trial examining the effectiveness and cost-effectiveness of internet-delivered Cognitive Behavioural Therapy (iCBT) for patients presenting with depression and anxiety, conducted at an established IAPT service with eligibility criteria described in Appendix S1 [33, 56]. The trial collected PROM data at baseline and 8-week across both trial-arms; additional data collection time-points for the intervention-arm only were at 3-, 6-, 9-, and 12-months. NHS England Research Ethics Committee provided trial ethics approval (REC Reference: 17/NW/0311). The trial was prospectively registered: Current Controlled Trials ISRCTN91967124. The trial is completed with the protocol and main results published [23, 33, 56].

Mapping models

Our mapping of interest is fitting ALDVMMs to the ReQoL-UI and EQ-5D-5L (VSE, USVS, or cross-walk); all utility scores are UK/England specific, apart from the USVS. When the predictions from ALDVMMs were deemed to not sufficiently suit the observed data, Betamix models were used instead. We used the aldvmm or betamix command within the statistical software package Stata Version 17 [57]. The aldvmm command estimates the variant of the model presented in Hernández Alava et al. [27, 58]. Full instructions on how to use the aldvmm command are described by Hernández Alava and Wailoo [29]. The betamix command is described by Gray and Hernández Alava [31].

ALDVMMs are flexible models that can approximate many distributional forms by combining (mixing) multiple component distributions; each component’s distribution is allowed to have different parameters for the same set of variables (i.e. xvars). Additional probability variables (i.e. pvars) predict the probability of each observation belonging to each component. Betamix models are similar to ALDVMMs in terms of being mixture models; although, key differences are that they are designed for dependent variables bounded in an interval (i.e. beta distributions are bounded between 0 and 1) and there are additional modelling options such as being able to specify a probability mass (i.e. pmass) at the lower and upper score, and some defined truncation point, of the dependent variable.

We estimated ALDVMMs (and Betamix when required) with 2–4 components; although it is possible to estimate 1-component models, fitting more than 1-component tends to improve model fit so we don’t present the 1-component model results. We describe how we moved from 2 to 4 component models in Appendix S1. For all ALDVMMs, we included PHQ-9 summary score (continuous variable), GAD-7 summary score (continuous variable), age (continuous variable), and sex (binary variable) to predict the utility scores within the components; however, we evaluate models with different variables and specifications. When a Betamix was chosen as preferable, only the PHQ-9 and GAD-7 summary scores were included as the core covariates of interest given the additional computational time and complications of trying to assess more modelling specifications using Betamix relative to ALDVMMs.

Model fit statistics and graphs

To compare results across models, we considered standard model fit measures/criteria such as absolute mean error (AE), mean absolute error (MAE), root mean square error (RMSE), log likelihood (LL), Akaike information criteria (AIC), Bayesian information criteria (BIC), and graphical methods for model selection in mapping [59]. An AE closer to zero, higher LL, and lower MAE, RMSE, AIC, and BIC indicated a better fit. Graphical methods have been shown to be essential for mapping model selection as described in Appendix S1 [59]; due to the number of models included in this mapping study which produced a large number of graphs, we only compare graphs between two models based on any given target utility score after assessing their model fit statistics. Specifically, we plotted the mean of the predicted utility scores with the mean observed values by PHQ-9 and GAD-7 scores. We also simulated data from the models and plotted the cumulative distribution functions (CDFs) comparing simulated with observed data across the severity range.

Throughout we followed ISPOR good practice mapping guidance [34]. As ISPOR good practice mapping guidance does not wholly support the use of internal validation approaches (i.e. splitting the dataset into an estimation and validation dataset), in part because sample splitting means a reduced sample size for estimation and there is uncertainty around what extra value the information these validation analyses provide, we have opted to not split the dataset for such an internal validation approach [34].

Results

Descriptive statistics of the estimation dataset population

Overall, 353 people at baseline across both trial-arms (237 intervention; 116 control) completed the ReQoL-10, GAD-7, and PHQ-9; 352 completed the EQ-5D-5L. Across all six data collection time-points, 1340 observed value scores for each of the ReQoL-UI, GAD-7, and PHQ-9 were available; 1339 for the EQ-5D-5L. All observed case data across all time-points and trial-arms were used for mapping.

The sample mean age at baseline is 33 (range: 18–74) with a female majority (71%). Figure 1 presents the distributions of PROM scores, with comparisons of ‘baseline’ vs ‘all time points’ distributions showing a sample shift towards the healthier part of the distributions. The ReQoL-UI has a smoother distribution than EQ-5D-5L utility scores. Additional descriptive statistics are provided in Appendix S1.

Fig. 1
figure 1

Distribution of ReQoL-UI, EQ-5D-5L VSE, USVS and cross-walk, PHQ-9, and GAD-7 scores at baseline and across all time-points

Model fit statistics

Model fit statistics for 36 ALDVMMs models are presented in Table 1: 12 ALDVMMs to each of the ReQoL-UI, EQ-5D-5L VSE and cross-walk. Generally, across all models, increasing the number of components improved model fit and there were no perceived issues with the use of ALDVMMs.

Table 1 Model fit statistics for the ALDVMMs for the ReQoL-UI, EQ-5D-5L VSE and cross-walk

Model fit statistics for both ALDVMM and Betamix model specifications to the USVS are presented in Table 2. Although the ALDVMM fit statistics seemed reasonable, graphical methods identified an issue that suggested Betamix might be preferable (see “Comparison of mean predicted and observed utility scores” section). When using ALDVMMs and Betamix, both sets of models had convergence problems or were tending to unbounded models when attempting to fit 4-components; therefore, no 4-component model results are reported related to the USVS.

Table 2 Model fit statistics for the ALDVMMs or Betamix for the EQ-5D-5L USVS

ReQoL-UI

The lowest predictive errors (i.e. lowest MAE and RMSE values) were attained when the pvars were PHQ-9, GAD-7, and sex (e.g. model R6). Including age as an additional pvar increased goodness of fit (i.e. higher LL and lower AIC values); however, it does so by increasing the predictive error (i.e. increased RMSE and MAE values) for example when comparing between R3 and R6. The lowest BIC was for R11 which is not surprising given the way BIC penalises having more variables, despite the benefits the inclusion of more variables has on performance statistics other than BIC such as for R3 and R6.

EQ-5D-5L VSE

The lowest RMSE value was obtained when the pvars were PHQ-9, GAD-7, age, sex (i.e. V3), but goodness of fit improved when age and sex were not included as pvars (i.e. V12). The lowest MAE was for V7 which was a 2-component model which did not include age as a pvar; however, moving from a 2- to 4-component model tended to improve goodness of fit and RMSE, at the cost of MAE.

EQ-5D-5L Cross-walk

The best goodness of fit statistics and RMSE were when the pvars were PHQ-9, GAD-7, age, and sex (i.e. C3). BIC was lowest for the model with the least pvars (similar to the ReQoL-UI and VSE); the lowest MAE was for C9.

EQ-5D-5L USVS

Betamix was preferred to ALDVMMs. For the Betamix models, the lowest predictive error was for a 2-component model; although, the better goodness of fit statistics were for the 3-component model.

Comparison of mean predicted and observed utility scores

Based on model fit statistics, we use graphical methods to compare between the following 4-component models: R3 and R6; V3 and V12; C3 and C9. For the USVS, we use graphical methods to compare between 2- and 3-component, ALDVMM (A-U1 Vs A-U2) and Betamix (B-U1 Vs B-U2) models. Figure 2 (UK/England utility scores) and Fig. 3 (USVS) presents the mean predicted and observed utility scores, and Fig. 4 presents the CDFs for the simulated data.

Fig. 2
figure 2

Mean predicted and observed utility scores for models: R3 Vs R6; V3 Vs V12; C3 Vs C9

Fig. 3
figure 3

Mean predicted and observed utility scores for ALDVMMs (A-U1 Vs A-U2) and Betamix models (B-U1 Vs B-U2)

Fig. 4
figure 4

Cumulative distribution functions for the simulated data for models: R3 Vs R6; V3 Vs V12; C3 Vs C9; A-U1 Vs A-U2; B-U1 Vs B-U2

ReQoL-UI

The benefits of R6’s lower MAE and RMSE relative to R3 becomes more apparent in Fig. 2, particularly based on the observed versus predicted utility scores at the severe end of the PHQ-9 score scale i.e. ≥ 23. That is, we can visually see that the predicted error for R3 is larger than for R6 for those people with a PHQ-9 score ≥ 23. Across the GAD-7 score scale, the predicted errors seems visually similar between models R3 and R6. Based on the CDFs there is little difference between the actual and modelled data for both R3 and R6, so this suggests both models fit equally well in terms of the distribution.

EQ-5D-5L VSE

The visual comparison between V3 and V12 is less clear-cut than between R3 and R6. Figure 2 indicates both models map well across the GAD-7 score scale, but have larger predictive errors at the severe end of the PHQ-9 score scale i.e. ≥ 23. Although not instantly obvious based on the CDFs (Fig. 4), V3 does fit slightly better than V12 across the utility score range of 0.6 to 0.9.

EQ-5D-5L Cross-walk

The visual comparison between C3 and C9 is again less clear-cut, with Fig. 2 again suggesting good fit with the GAD-7, larger predictive error when PHQ-9 score scale ≥ 23, and almost identical CDFs; this is not surprising though given the almost identical model fit statistics with small between-model trade-offs in MAE and RMSE.

EQ-5D-5L USVS

Although the mapping function from the ALDVMMs fit reasonably across the clinical (Fig. 3) and utility score range (Fig. 4), the models were not fitting well for higher utility values; such that the proportion of perfect health values (1) implied by the estimated ALDVMMs is too high, as shown in Fig. 4. In comparison, the Betamix models overcame this issue with lower predictive error statistics than for the ALDVMMs, also shown in Fig. 4. Figure 4 visual comparisons between B-U1 and B-U2 revealed a slightly better fit across the middle score range (e.g. between 0.4 and 0.7) with similar fit across the rest of the score range.

Choosing a mapping function

For each target UK/England utility score, comparisons were made across all 12 models; however, for descriptive purposes, here we focus just on comparisons between models: R3 and R6; V3 and V12; C3 and C9.

  • ReQoL-UI: R6 is chosen due to its lower MAE and RMSE, but also based on the visual comparisons across the mean predicted and observed utility scores across the PHQ-9 and GAD-7 score ranges.

  • EQ-5D-5L VSE: V3 is chosen due to its lower MAE and RMSE despite the differences between models not initially being visually obvious using graphical methods.

  • EQ-5D-5L Cross-walk: C3 is chosen due to its lower RMSE and better goodness of fit statistics; although, the model was very similar to C9 both in terms of model fit statistics and based on graphical methods.

    For the USVS when comparing between the 2-component and 3-component Betamix models, the predictive error statistics and fit through visual inspection was better for the 2-component model despite the 3-component model having the better AIC and BIC. Therefore:

  • EQ-5D-5L USVS: B-U1 was chosen because of its fit at higher utility scores than the ALDVMMs, and lower predictive errors both in statistics and visually compared to the other Betamix model (B-U2).

Discussion

Across all mapping models to UK/England utility scores, we selected 4-component models where utility within each component was a function of PHQ-9, GAD-7, age, and sex. For mapping to the ReQoL-UI we selected R6, where the probability of component membership was a function of PHQ-9, GAD-7, and sex. For mapping to the EQ-5D-5L VSE or cross-walk we selected V3 or C3, respectively, where the probability of component membership was a function of PHQ-9, GAD-7, sex, and age. Results pertaining to alternative model specifications are presented in Appendix S2.

For the USVS, the mapping process and results were more complicated. For the ALDVMMs, the models did not fit well for higher utility values, such that the proportion of perfect health values (1) implied by the estimated model was too high. Even though moving from 2- to 3-components reduced the proportion of ones, ALDVMMs were unable to match the observed proportion. The problem stemmed from the large probability mass present in the USVS sample distribution just below the gap (see Fig. 1) which would require a degenerate distribution. This is difficult to achieve with the ALDVMM, thus leading to the decision to use Betamix that is able to generate a separate probability mass at the truncation point.

Predictions from our recommended mapping functions are provided in an Excel-based lookup table, provided as part of the online Supplementary Materials.

Mapping to the USVS relative to the UK/England utility scores

The USVS in our estimation sample caused complications for our identified ALDVMMs that did not occur when mapping to the EQ-5D-5L VSE or cross-walk, nor ReQoL-UI. It should be noted that ALDVMMs are quicker and easier to fit than Betamix; however, Betamix has been developed to have more modelling options and therefore some additional flexibility for mapping than ALDVMMs when required. In this case, it was the ability of Betamix to specify probability mass at the upper (i.e. 1) and truncation (i.e. 0.943) values of the USVS which enabled us to overcome the problems when using ALDVMMs at the upper end of the utility scale, despite the additional computational time and considerations required to fit Betamix relative to ALDVMMs.

Comparisons with previous mapping studies

We identified three previous mapping studies relevant for comparison with our mapping study from the GAD-7 and/or PHQ-9 to the ReQoL-UI and/or EQ-5D (five or three-level versions) as part our pre-mapping considerations to inform our mapping plans.

Brazier et al. [16] included the GAD-7 and PHQ-9 (among other mental health measures) with intentions to map to the EQ-5D-3L and SF-6D. This study used more traditional mapping models (OLS, Tobit, and response-level) rather than more modern and currently recommended mixture models; however, Brazier et al. [16] was published in 2014 before mapping using mixture models gained widespread attention. It is important to note that Brazier et al. [16] never mapped from the GAD-7 and PHQ-9 to the EQ-5D(-3L); rather, they mapped from the GAD-7 and PHQ-9 only to the SF-6D, with an alternative mental health measure (the Hospital Anxiety and Depression Scale, HADS) being used to map to the EQ-5D-3L. This was because the IAPT estimation dataset (one of four datasets) they had available with the PHQ-9 and GAD-7 only included the SF-6D, not the EQ-5D-3L. However, through inference from all the mapping they conducted, their overall conclusion was that “mapping from mental health condition-specific measures, such as the widely used PHQ-9, GAD and HADS, may not be an appropriate approach to generating EQ-5D and SF-6D scores as these measures focus on specific symptoms and not on the wider impact of mental health conditions”. Our current mapping study and associated previous psychometric analysis does not concur with Brazier et al. [16] conclusion [24], noting that our mapping studies are not completely alike (e.g. due to using a different target measure). However, reasons our conclusions do not concur could be associated with our use of more suitable mixture regression models for mapping compared to traditional mapping models (e.g. OLS) which have known limitations, that we are using the newer EQ-5D-5L rather than the previous EQ-5D-3L which has known shortcomings in mental health populations, and that we mapped from the PHQ-9 and GAD-7 to the EQ-5D-5L (and ReQoL-UI) which this previous study did not [13,14,15,16,17,18,19,20, 25,26,27].

Furukawa et al. [55] ‘mapped’ from the PHQ-9 to the EQ-5D-3L using a non-regression-based approach (i.e. equipercentile linking); however, Furukawa et al. [55] does not describe itself as a mapping study and thus does not follow any current mapping guidance. The current first author published a correspondence about the study by Furukawa et al. [55] which outlines concerns about the study and the ‘mapping function’ it produced, to which a response was also published [53, 54]. Overall, the study by Furukawa et al. [55] provides little to no model performance statistics, thus comparisons cannot be made with our current mapping study.

Keetharuth and Rowen [60], a non-peer-reviewed article, mapped from the HoNOS to the ReQoL-UI. Although Keetharuth and Rowen [60] follow mapping guidance and is appropriately reported, it has two key limitations: first, only OLS models are used; second, the HoNOS is clinician-reported thus the completer’s perspective is different to that of the ReQoL-UI (i.e. patient-reported) which limits the conceptual overlap between the two measures. Keetharuth and Rowen [60] recognise these limitations, thus recommend caution when using their mapping functions.

Overall, previous mapping studies have not produced mapping functions between our source and target measures, with those mapping studies which are somewhat comparable to ours using more traditional regression-based (e.g. OLS) or non-regression-based (i.e. equipercentile linking) methods compared to the more modern and currently recommended mixture regression models we have used. Our study further emphasises the benefits of using mixture models, with ALDVMMs being a good starting point as they work well for mapping when used appropriately [25,26,27]. Alternatively, Betamix can overcome the shortcomings of ALDVMMs (e.g. for the USVS in our study), noting Betamix is computationally more complicated and time consuming despite its relative benefits, thus ALDVMMs are the preferred starting model as was the case for this study. Overall, our mapping functions represent a needed tool for predicting utility values from the commonly used PHQ-9 and GAD-7 mental health measures.

Using the alternative predictions: aspects for consideration

Although all our predicted utility scores can be used to estimate QALYs, the source of these utility scores requires careful consideration. Firstly, each of our target utility scores have been shown to produce different QALYs [23]; therefore, it is logical to assume these predictions will produce different QALYs. The EQ-5D-5L is the more commonly used and known preference-based measure, relative to the newer ReQoL-UI. The constructs of these measures are different; although both are suggested to be ‘generic health measures’, the descriptive system of EQ-5D-5L is more physical health focussed relative to the ReQoL-UI’s more mental health focus. The measures and associated utility scores have also been shown to have different relationships with anxiety and depression as measured by the GAD-7 and PHQ-9, respectively, which will have influenced the mapping models [24]

Use of predicted utility scores: strengths and limitations

The mapping predictions have been estimated from a specific patient population involved in an IAPT-based trial: new IAPT Step 2 service referrals who met the trial eligibility criteria. IAPT Step 2 focusses on specific mental health populations and interventions; i.e. common mental health conditions that could benefit from low intensity therapies as brief psychological interventions (e.g. digital mental health interventions, Bibliotherapy) offered with support from clinicians [61]. Additionally, our data collection time-period covers a 12-months care pathway when the patient is on a waiting-list or treatment, and a period during post-discharge. As such, we have less data that covers the ‘severe’ spectrum of anxiety and depression (mainly from baseline assessment) and this could explain our mapping models’ poorer performance at the severe end of the scale. Therefore, in mental health populations where ‘severe’ depression and anxiety is more prevalent (e.g. inpatient settings), our mapping functions are prone to higher predictive errors; alternative mapping predictions should be sought in such severe patient populations. For mental health trials wanting to use the predictions, consideration should be given to how an IAPT Step 2 population is representative of their trial population; for example, comparative assessment against our PROM score distributions in Fig. 1 with additional estimation sample descriptive statistics in Appendix S1.

Conclusion

Our mapping functions can be used to predict either the ReQoL-UI, EQ-5D-5L VSE, USVS or cross-walked utility scores from the PHQ-9 and GAD-7 summary scores. Our analyses found that including more than one component improved model fit, with the preferred ALDVMMs based on 4-component models, and that Betamix was preferred to ALDVMMs when mapping to the USVS only. Our mapping functions can be used in economic evaluations to predict utility as a function of the commonly collected PHQ-9 and/or GAD-7 summary scores.