FormalPara Key Points for Decision Makers

Compared with the EQ-5D-Y-3L, the EQ-5D-Y-5L provides more information about the severity of children and adolescents’ health-related quality of life.

The EQ-5D-Y-5L shows moderate validity and fair reliability and can be included in economic evaluation and clinical and quality-of-care studies.

Although the self and proxy data cannot be directly compared, there is evidence that the characteristic of the relationship differs between the instruments across the samples.

1 Introduction

The EQ-5D is one of the most widely used instruments to measure health-related quality of life (HRQoL) in adults [1]. The EQ-5D-Y was developed to measure HRQoL in children and young people aged 4–15 years, and aims to facilitate the economic evaluation of healthcare interventions to inform decision making for children and young people [2]. A self-report version is available for children aged 8–15 years, and a proxy-report version was developed for parents or caregivers of children aged as young as 4 years, which can also be used for older children who are unable to complete the self-report. The EQ-5D-Y was developed based on the adult EQ-5D and defines items for the domains of mobility, ‘walking around’; selfcare, ‘looking after myself’; usual activities, ‘doing usual activities’; pain or discomfort, ‘having pain or discomfort’; and anxiety or depression, ‘feeling worried, sad or unhappy’ [3, 4].

Initially, the EQ-5D-Y was developed with three response levels for each domain (EQ-5D-Y-3L, hereon Y-3L). Prior research has shown the Y-3L to be a reliable instrument [5], however it has been criticised for the extent to which it reflects variability in individual responses due to having only three levels [6] and having high ceiling effects [5, 7] (where a high portion of respondents report ‘no problems’ to all instrument domains or items). To overcome the limitations and improve the psychometric performance of the Y-3L, a version with five response levels (EQ-5D-Y-5L, hereon Y-5L) has been developed [8]. The feasibility of the Y-5L has been confirmed in previous studies, however these studies have suggested that further testing of the psychometric properties of the Y-5L is required on different samples and conditions [9,10,11,12].

The psychometric features of the adult versions (which also has a three- and a five-level version) have been compared previously, and most studies supported the advantages of the EQ-5D-5L descriptive system. By using the 5L version, the ceiling effect decreased and the discriminatory ability and informativity have increased [13, 14]. It is unknown to what extent the advantages of the adult 5L over the adult 3L will also be apparent in the Y-5L compared with the Y-3L. Understanding the differences in psychometric properties between Y-3L and Y-5L can be beneficial in informing valid HRQoL instrument choice and for informing clinical and health policy decision making.

In paediatric research, it is common to use proxy (often caregivers) responses on behalf of the child, especially when the child is younger or does not have the ability to respond. However, studies have demonstrated inconsistencies between child self-reports and proxy-reports [15, 16]. The agreement between the two groups tends to be closely related to physical domains, such as mobility, with divergence in responses when comparing mental domains such as anxiety [17]. Proxies often rate functioning worse than children rate for themselves [18]. Understanding the psychometric properties of the Y-5L compared with the Y-3L in both these groups is therefore important.

This study aimed to compare the distributional and psychometric properties of the Y-3L and Y-5L instruments in separate self- and proxy-report populations using data from the Australian Paediatric Multi-Instrument Comparison (P-MIC) study. The study aimed to assess the following psychometric properties: ceiling effects, construct validity, reliability and informativity. We also explored how the data from each instrument is distributed relative to the other in the P-MIC sample.

2 Methods

2.1 Data

The data for this research were collected as a part of the P-MIC study [19]. The P-MIC study gathered demographic information simultaneously alongside multiple generic paediatric HRQoL instruments from Australian children (as self-report) or caregivers’ providing their perception of the child’s health (proxy-report). Caregivers of children aged 7–18 years were asked if their child was able to self-report their HRQoL. If the caregiver said yes, only the child was asked to self-complete the HRQoL instruments; if the caregiver said no or the child was younger than 7 years of age, the caregiver was asked to proxy-report the HRQoL instruments. None of the instruments were interviewer-administered [20]. Data included in the analysis are from P-MIC data-cut 2, dated 10 August 2022, and include 94% of the total planned P-MIC participants [21]. The data are of children aged 5–18 years [20] from different health conditions and general population samples recruited via a hospital or online survey panel. The wider study measured HRQoL for children aged 2–4 years using an experimental version of the Y-5L instrument, and this is reported elsewhere [22]. The health condition groups are recurrent abdominal pain, attention-deficit hyperactivity disorder (ADHD), anxiety or depression, autism spectrum disorders (ASD), asthma, dental problems, eating disorder, epilepsy, and sleeping problems. A detailed explanation of the P-MIC study survey design, data collection samples and data-cuts have been reported by Jones et al. [21].

Data were gathered using online surveys via REDCap. Participants completed both the Y-3L and Y-5L instruments in random order and also completed the EuroQol visual analogue scale (EQ-VAS), which was always presented with the Y-3L. The design of the survey data collection form required a response to each question and hence did not permit missing data. A random subset of participants in the general population sample received the follow-up survey at 2 days to allow for reliability assessment.

2.2 EQ-5D-Y-3L (Y-3L) and EQ-5D-Y-5L (Y-5L) and EuroQol Visual Analogue Scale (EQ-VAS)

Children and adolescents’ HRQoL was measured using the Y-3L and Y-5L descriptive systems and were asked to report the child’s health status ‘today’. The Y-3L has three response levels (no, some, and a lot/very), which result in 243 unique health states. The Y-5L has five response levels (no, a little bit of, some, a lot of, cannot/extreme), which lead to 3125 possible health states. The Y-5L not only increases the number of levels but also changes the level labels, meaning that level 3 on the Y-3L is descriptively equivalent to level 4 on the Y-5L in four domains, reflecting an additional layer of severity in the Y-5L.

The health states are often described as five-digit vectors by taking one level for each domain, with 11111 representing full health and 33333 and 55555 representing the worst possible health state for the Y-3L and Y-5L respectively.

The EQ-VAS comprises a scale from 0 to 100. Respondents are asked to report their health status ‘today’, where zero is labelled as the worst imaginable health and 100 indicates the best imaginable health.

2.3 Statistical Analysis

2.3.1 Ceiling Effects

An instrument was considered to have a ceiling effect when more than 15% of participants rated their health status at the best level of each domain (i.e. 11111) [23, 24]. In this study, the absolute change in ceiling effect was examined as the difference between the proportions of patients with ceiling effects across the two instruments. The relative reduction was also calculated using Eq. 1:

$$\left( {{\text{ceiling Y-3L}} - {\text{ceiling Y-5L}}} \right)/{\text{ceiling Y-3L}}*{1}00$$
(1)

Absolute and relative changes in ceiling effect were calculated in the sample of children with conditions, the general population sample and an overall combined sample. Results were separated for proxy- and self-report.

Due to the provision of more response level options aside from level 1, the ceiling effect was expected to be lower for Y-5L, as was the case for adult instruments [25, 26].

2.3.2 Inconsistency and Redistribution

Redistribution properties and the level of response consistency were assessed using the criteria applied by Janssen et al. [27]. An inconsistent response in the current study is described as a Y-3L response that is at least two levels away from the Y-5L response, for instance if the respondent chose level one on Y-3L but chose level four on the Y-5L instrument. The size of the inconsistency was measured as |Y-3L – Y-5L|–1. The redistribution properties of the consistent response pairs were described as proportions of the Y-3L–Y-5L response pairs within each Y-3L response level (Y-3L-1, Y-3L-2, and Y-3L-3). Sankey diagrams [28] were used to show the cross-tabulations for each level and domain of the Y-3L with the corresponding level and domain for the Y-5L.

2.3.3 Criterion Validity

Criterion validity indicates how accurately the new measure (Y-5L) assesses the same content as the previously employed measure (Y-3L) and was established by comparing the instrument domain scores using Spearman correlations. Spearman’s rank correlation coefficient for skewed data between the Y-3L and Y-5L domains was used. Coefficients from 0.1 to 0.29 were classified as low, 0.3 to 0.49 were classified as moderate, and correlations of ≥ 0.5 were classified as high [29]. The EQ-5D-Y instruments have the same domains, therefore it is hypothesised that these domains will have a high correlation.

The assessment of validity did not include convergence and divergence, which are presented in a separate papers [30, 31].

2.3.4 Discriminatory and Informativity Power

Shannon index (H′) and Shannon evenness index (J′) were applied to assess the informativity and discriminatory power of the instruments. This index shows the distribution of response in each domain. The Shannon index (H′) is used to estimate the discriminatory power for each domain in the Y-3L and Y-5L classification systems. The formula for the Shannon index is shown in Eq. 2:

$${H}^{\prime}=-\sum_{i=1}^{L}{p}_{i }{log}_{2}{p}_{i}$$
(2)

where \({H}^{\prime}\) represents the absolute amount of informativity captured, L is the number of possible levels, and \({p}_{i }={n}_{i}/N\), where \({n}_{i}\) is the observed number of responses in the ith level (i=1,……,L) and N is the total sample size [32]. The higher the Shannon index, the more information is obtained by the classification system. In the case of an equal (rectangular) distribution, meaning that all levels have the same number of responses, the optimal amount of information is captured and \({H}^{\prime}\) has reached its highest limit (\({H}^{\prime}max\) ), which equals to \({log}_{2}L\); therefore, if the number of levels (L) increases, the \({H}^{\prime}max\) will rise accordingly. The Shannon evenness index (J′) is described as \({J}^{\prime}\)= \({H}^{\prime}\)/\({H}^{\prime}max\) and indicates the evenness of the distribution regardless of the number of levels. The Y-5L is expected to result in more information, thus it is hypothesised to have a higher Shannon index.

2.3.5 Test–Retest Reliability

P-MIC data included a question that asked participants about their health at baseline and at follow-up, with participants selecting between ‘the same’, ‘worse’ or ‘better’. Among the participants who completed the baseline survey, a subset of 169 were allocated to the 2-day follow-up period and responded. From this group, 115 participants (68%) who reported ‘the same’ health in response to the question above were included in the test-retest reliability analysis.

The test-retest reliability of the instruments was analysed by comparing the initial survey responses and the 2-day follow-up survey responses using EQ-VAS categories. EQ-VAS scores were used as there are currently no value sets available for the EQ-5D-Y instruments in Australia, and EQ-5D-Y test-retest reliability using level sum scores are presented in a separate paper (28). As EQ-VAS is scaled from 0 to 100, even slight health changes (for example response 70 vs. 72) might have a large impact on test-retest reliability. To circumvent this, the EQ-VAS scores were grouped into 10 categories, and the agreement of these 10 groups were then compared at baseline and follow-up (2 days). The reliability of the Y-3L and Y-5L was also assessed by estimating the weighted kappa coefficients of domains at the 2-day follow-up to estimate concordance. Kappa shows the extent of agreement between two sets of data collected at two different time points [33]. Interpretation of agreement using kappa coefficients was prespecified as follows: kappa < 0.2 indicates poor agreement, 0.21–0.40 indicates fair agreement, 0.41–0.60 indicates moderate agreement, 0.61–0.80 indicates substantial agreement, and kappa > 0.81 indicates almost perfect agreement [34]. Test-retest analysis on a domain basis has been published elsewhere [30].

3 Results

3.1 Sample and Demographics

In total, 5945 respondents completed both the Y-3L and Y-5L, with 2083 surveys completed by the proxies of children aged 5–18 years (of which 979 were 5 or 6 years of age); and 3862 self-completed by children aged between 7 and 18 years. In the total sample, 25.7% of the children were part of the general population sample, while the remainder were part of condition groups or were hospital recruits. The mean age of the child in the proxy group was 9 years (± 2.4) and the mean age for the self-report group was 11.9 years (± 3.7). The sample was 44.9% female and 53.8% male in the proxy group, and 46.7% female and 51.7% male in the self-report group. The remainder of the participants were from other gender groups or preferred not to answer the question. Among the proxies, 95.1% were children’s parents, 2% were grandparents, 0.4% were their sibling and 2.5% were other types of carers. The proportion of stated problems on the Y-3L and Y-5L are reported in Table 1. More details regarding respondent characteristics, including by age and condition, can be found in the papers by Jones et al. [20, 30, 35].

Table 1 Description of responses for proxy-reports [n = 2083] and self-reports [n = 3862]

3.2 Psychometric Results

3.2.1 Ceiling Effect

Table 2 indicates the percentages of the ceiling effect and its change in absolute and relative terms for the Y-5L compared with the Y-3L in the group of children with conditions, the general population group, and the combined overall sample. The proportion of respondents who reported no problems was lower in Y-5L samples in comparison with Y-3L. The reduction of ceiling effect in relative terms was 14.02% in the proxy group and 16.65% in the self-report group. Both versions, Y-3L and Y-5L , of the questionnaire present a similar absolute ceiling effect reduction (Additional file 1: Table S1 shows the proportion of responses at level 1 for each domain).

Table 2 Changes in ceiling effect in different condition groups

Both instruments demonstrated ceiling effect issues, especially for the asthma and dental problem condition groups for both proxy and self-reports. The absolute change in ceiling effect between instruments was highest in ‘asthma’ for both groups (Table 2). Epilepsy had a negative change in proxy reports; however, the change is only 1% and it is therefore not a pattern that can be interpreted with any confidence.

3.2.2 Inconsistency and Redistribution

Cross-tabulations of responses to the Y-3L and Y-5L showed that participants reported health across the EQ-5D-Y domain levels. Table 3a and b show the proportion of consistent and inconsistent responses. The highest inconsistency was related to ‘feeling worried, sad or unhappy’ and the lowest was related to ‘mobility’ in both the self-report and proxy groups. Consistent levels are bolded in Table 3a and b. The inconsistent level ranged from 3.42 to 11.30% for the self-report, and between 2.40 and 10.99% for the proxy-report data. A related table for the proportion of consistent levels can be found in Additional file 1: Table S2. Most of the inconsistent responses were related to participants choosing level 2 on Y-3L and level 1 on Y-5L.

Table 3 Redistribution properties from Y-3L to Y-5L

Table 4a and b present responses for Y-3L and Y-5L domains for the condition groups. In general, the Y-5L responses demonstrate a redistribution across most domains and condition groups, underscoring the advantages offered by the additional levels in the Y-5L measure. Mobility has the highest percentage of ‘no problem’ in both groups.

Table 4 Domain responses for Y-3L and Y-5L in different condition groups

Results show preliminary evidence supporting the validity of the instrument across various conditions, as distributions align with expectations based on the type of condition. For example, there has been a reduction in the ceiling effect in the 'feeling worried, sad or unhappy' domain for participants reporting mental health conditions such as ADHD, anxiety, depression, ASD, and eating disorders. For children with mental health conditions, higher severity levels were reported for the ‘feeling worried, sad or unhappy’ domain, leading to a broader distribution of responses.

Figures 1 and 2 demonstrate how the Y-3L levels are distributed on the Y-5L levels. The results indicate that ‘feeling worried, sad or unhappy’ is more distributed among the response levels, compared with other domains, with more responses in the severe levels. The response pattern is relatively consistent across both groups; for instance, ‘mobility’ has the highest number of respondents in the ‘no problem’ response level.

Fig. 1
figure 1

Sankey diagrams for level proportions for proxy-reports at baseline

Fig. 2
figure 2

Sankey diagrams for level proportions for self-reports at baseline

3.2.3 Criterion Validity

Table 5 shows the results of Spearman correlations between the Y-3L and Y-5L. In both the proxy and self-report data, the highest correlation is related to the same domain, consistent with expectations. The correlation of the same domains for Y-5L and Y-3L was high (> 0.5). The highest correlation was between the domain ‘looking after myself’ in both groups (proxy-report = 0.84 and self-report = 0.75), and the lowest correlation was between the domain ‘feeling worried, sad or unhappy’ in both groups (proxy-report = 0.64 and self-report = 0.65).

Table 5 Criterion validity of Y-3L and Y-5L in proxy-reports (n = 2083) and for self-reports (n = 3862) using correlation coefficients

3.2.4 Informativity

The results of the analysis of informativity of the Y-3L and Y-5L are presented in Table 6. The Shannon index (H′), showed a gain in all domains and indicates better informativity and discriminatory performance of the Y-5L. Higher Shannon index in the Y-5L can be interpreted as more information being distributed in the levels. The Shannon Evenness index (J′) illustrates that the relative informativity of the change from Y-3L to Y-5L is almost comparable, with only a marginal decrease in all domains.

Table 6 Discriminatory and informativity power between Y-3L and Y-5L

3.2.5 Test–Retest Reliability

The Kappa between the EQ-VAS scores categorised in 10 groups was 0.74 (p < 0.001) for the proxy-reported group, indicating substantial agreement, while the corresponding Kappa for the self-reported group was 0.51 (p < 0.001), indicating moderate agreement. The Kappa coefficient for the domains are presented in Table 7; the p-value for all domains was < 0.001. Table 7 shows that the reliability pattern is not consistent for either of the instruments and that it is domain-driven. Mobility had the highest agreement in both groups.

Table 7 Kappa coefficients for test-retest for Y-3L and Y-5L (n = 115)

4 Discussion

The aim of this study was to assess the psychometric performance of the Y-5L in comparison with the Y-3L, in both self-reported and proxy-report responses. The psychometric performance of Y-5L compared with Y-3L was reported in terms of ceiling effects, criterion validity, inconsistency and redistribution properties, informativity (discriminatory power), and test-retest reliability. Overall, Y-5L is a valid and reliable extension of the Y-3L. The domains have a high correlation, suggesting evidence of criterion validity. In addition, the Y-5L showed superior discriminatory power, an improved distribution, and slightly reduced ceiling effects compared with the Y-3L (some conditions showed a substantial reduction in ceiling effect). Ceiling effects were still present for the Y-5L, however they were lower compared with the Y-3L. The lower ceiling effect for Y-5L has also been seen in other studies for children [9, 12], young adult and adult [13, 36] populations. The ceiling effect was high in the proxy-reported data, which might be caused by different reporting patterns, with children potentially more likely to report a problem. For both groups, the highest proportion of problems was reported for ‘feeling worried, sad or unhappy’, while the percentage of reported problems for ‘mobility’ was small. These results are consistent with the results of another study comparing the youth instruments [9]. The Y-5L has a lower ceiling effect compared with the Y-3L, which demonstrates that the increase in levels leads to more granular responses when reporting mild health problems. A reduction in ceiling effect also suggests that improvement (e.g. from Level 2 to Level 1) over time may be captured on the Y-5L which would not be detected by the Y-3L, although responsiveness testing is needed to test this. All the condition groups showed a lower ceiling effect for the Y-5L compared with the Y-3L, except for the epilepsy proxy-reported data.

The low percentage of inconsistencies found when comparing the two instruments shows that the Y-3L data redistribute logically to the Y-5L, and that the Y-5L and Y-3L descriptive systems are comparable. This could in part be due to a high ceiling effect, where both instruments tend to show that children, especially those from the general population, do not have any problems in a certain area, even with a refined response scale. This has been previously observed in studies of both adults and young adults in general populations [36,37,38].

The two instruments showed a high degree of correlation based on the criteria. The remaining variation is driven by the level differences, which could be due to the inconsistencies in wording across the levels, and the reason for the differences could be qualitatively further investigated. The highest inconsistency was related to ‘feeling worried, sad or unhappy’, which also had the greatest difference in the Shannon index, i.e. gained more discriminatory power when changing from the 3L to 5L. Hence ‘feeling worried, sad or unhappy’ is the domain that is mostly changing when adding levels to the Y-3L in our data. The greater number of differences observed in the ‘feeling worried, sad or unhappy’ domain is likely due to the larger number of responses that are not level 1, meaning greater potential for inconsistency.

The inclusion of additional levels in the instrument appears to have had an impact on the distribution of responses within the condition groups, particularly within the 'pain' and 'feeling worried, sad or unhappy' domains. These two domains also showed more variability in another youth comparison study (9). It is conceivable that changing the level 2 description in the Y-5L from 'some problem' to 'little problem' may have influenced participants to select level 2 over level 1, or having a more extreme level in Y-5L has caused the distribution from level 3 to levels 4 and 5. The spread of responses in different levels, especially from level 3 of Y-3L to levels 4 and 5 of Y-5L, has also been seen in adult condition groups (14). This might be helpful when used by clinicians, alongside clinical trials, and other surveys to obtain more detailed data when using the Y-5L. This could be further tested in studies that include assessment of responsiveness when comparing the two instruments across different condition groups.

Extending the levels of the EQ-5D-Y descriptive system from three to five levels resulted in higher absolute, discriminatory power. Furthermore, the extension to five levels resulted in a diverse range of responses across severity levels for most condition groups, aligning consistently with findings from an adult comparison study [13]. The relative evenness of the Y-5L was comparable or slightly lower than that of the Y-3L. This trend aligns with observations from previous comparative studies involving adults [13, 27, 39].

In summary, the Y-5L provides more nuanced information about the severity of young people’s HRQoL by adding response levels; however, the reliability and responsiveness of Y-5L still needs to be determined for different condition groups. Both versions of EQ-5D-Y are useful tools for measuring HRQoL in young people, and the choice of which one to use will depend on the specific research question, condition group, study design, and available resources.

This study is not without limitations, one of which is that our analysis relied on using level sum scores due to the current lack of Australian value sets for either the Y-3L or Y-5L. These value sets are however currently in development. Future research can extend the analysis by preference weighting the descriptive data. Another limitation was a relatively small sample size for the 2 day interval for evaluating test-retest reliability, compared with the sample used for the other analyses conducted in this study. The study does not include a dyad sample to facilitate a direct comparison between self and proxy reports. Consequently, we assessed self-reported and proxy-reported data separately. Future research using dyad samples would allow for formal comparison of self- and proxy-reported data psychometric characteristics.

Overall, the results confirm the psychometric performance of Y-5L in the Australian paediatric population. The Y-5L is considered a valid and reliable instrument for measuring HRQoL in children and adolescents. The Y-5L may provide slightly more detail compared with the Y-3L and could be considered for use in clinical practice/clinical trials or other evaluations.