FormalPara Key Points for Decision Makers

This study showed that the adapted EQ-5D-Y-3L and EQ-5D-Y-5L had good overall psychometric performance in children aged 2–4 years.

The adapted EQ-5D-Y-3L and EQ-5D-Y-5L were shown to have improved psychometric performance compared with the original EQ-5D-Y-3L and EQ-5D-Y-5L for children aged 2–4 years.

The adapted EQ-5D-Y shows promise as a multi-attribute utility instrument in young children, where there are few options for utility measurement. Next steps include research to determine appropriate valuation and scoring to enable use in economic evaluation.

1 Introduction

Although generic patient-reported outcome (PRO) measures for the adult population are well established for a variety of purposes, this is currently not the case for paediatric populations [1, 2]. Various Health Technology Assessment (HTA) authorities, including the Pharmaceutical Benefits Advisory Committee (PBAC) and the Medical Services Advisory Committee (MSAC) in Australia, and the National Institute for Health and Care Excellence (NICE) in the UK, have highlighted the clear lack of PRO development and a need for PRO evidence to support decisions relating to paediatric populations [3, 4]. The demand for such tools is not surprising as young children are substantial consumers of healthcare resources and services [5].

In recent years, there has been an increase in research promoting the development, validation, and valuation of paediatric health-related quality of life (HRQoL) measures [6]. Despite the growing number of measures now available for the estimation of quality-adjusted life-years (QALYs) for paediatric populations, no generic preference-weighted measures are currently recommended for use in young children (under 5 years of age) [7]. The absence of a recommended instrument/s was highlighted in a systematic review of NICE appraisals. The review concluded that the lack of a recommended instrument meant that young children’s (< 5 years of age) HRQoL was either missing or being inappropriately obtained by using adult preference measures or direct elicitation techniques [8]. Without a suitably validated preference-accompanied measure, decision makers’ ability to efficiently allocate health care resources across age groups is compromised.

There are a myriad of reasons that may have contributed towards the absence of validated preference-weighted HRQoL instruments in young children. As proxy reporting is unavoidable for young children, this immediately creates differences with self-reported measurement in older children and adults, which may result in measurement biases such as proxy response shifts [9, 10]. The validity of the proxy report is additionally complicated for young children because it requires assessment of HRQoL that may be unobservable, difficult to observe or require significant proxy interpretation (e.g., detecting emotional concerns in a child under 5 years of age). In addition, there is the important measurement issue of capturing the right aspects of HRQoL for young paediatric populations, which may be different to the HRQoL domains relevant for older children or adults. While it may be desirable to measure a different construct of HRQoL in young children, this will produce additional difficulties associated with consistency of approaches and values generated across childhood and into adulthood.

The EQ-5D-Y, a preference measure for the paediatric population, was designed and developed by an international task force for self-reported use in children between the ages of 8 and 15 years, with proxy versions available for use in ages below 8 years [9, 10]. The EQ-5D-Y was developed by utilising the existing structure and content of the EQ-5D, which is a well-researched and validated instrument in the adult population. The overall content was found to be age-appropriate and relevant for the measurement of HRQoL in children, with minor changes made to the wording to improve measurement feasibility and relevance. The EQ-5D-Y consists of a descriptive system that contains five items referring to the dimensions: mobility, usual activity, looking after themselves, pain or discomfort, and worried, sad, or unhappy. In addition to the three-level (Y-3L) version, which is growing in use and is accompanied by several local value sets, there is also a five-level (Y-5L) version that has experimental status with a protocol under development to generate value sets [13].

One issue with paediatric HRQoL measures is the extent to which the same instrument can be used in different paediatric age ranges. For example, across childhood, children are developmentally distinct, with periods of rapid development. Age-specific HRQoL measures may be needed to account for these differences. This was highlighted by the Professional Society for Health Economics and Outcomes Research (ISPOR) Good Research Practices Taskforce, which noted four distinct age groups to consider for PROs; < 5 years, 5–7 years, 8–11 years, and 12–18 years, with the possibility of further subdividing the under 5 years age group into 0–2 years and 2–4 years [14]. Although the EQ-5D-Y was not initially designed for young children under 5 years of age, the instrument does contain some comparable dimensions to validated non-preference-weighted instruments for children aged 2–4 years, such as the Paediatric Quality of Life Inventory (PedsQL) [13, 14]. Prior research suggests that adaptations could be made to improve relevance in younger age groups without substantially altering the content or integrity of the instrument, as has been the case for the Child Health Utility 9D (CHU9D) instrument [17].

In related research, the co-authors of this study used focus groups to assess the appropriateness of Y-3L for children aged 2–4 years and to co-design an adapted version that would be considered more appropriate and relevant by parents/carers of young children aged 2–4 years [18]. The adapted version was co-designed with researchers (health economists, a paediatrician, and a psychologist) and parents/carers of young children aged 2–4 years with a wide range of health conditions [18]. The research focused on children aged 2–4 years as it was assumed that the current Y-3L measure was appropriate for proxy use in 5- to 7-year-olds and that children under 2 years of age would require alternative thinking. Findings from that qualitative research indicated that the content of Y-3L could be broadly suitable for use in 2- to 4-year-olds, but that the instrument’s acceptability and interpretability could be enhanced by the addition of examples to the dimension headers and some minor changes to wording in two dimensions. With that in mind, the research team co-produced an adapted version of the Y-3L for the 2- to 4-year-old age group in readiness for further testing. Pilot testing with parents/carers showed that the adapted instrument was easy to complete and that the wording and examples were considered appropriate and age-relevant. Given that the changes involved adding examples to dimension headers, they are considered equally applicable to both the Y-3L and Y-5L.

It was hypothesised that adaptations co-designed for the EQ-5D-Y for parent proxy reporting of children aged 2–4 years would capture appropriate aspects of HRQoL and would assist parents to identify and report both observable and unobservable domains. This is an important first step in establishing whether a measure largely consistent with measures for older children could be valid.

Given the adapted Y-3L and Y-5L were recently co-developed, this study aimed to assess the validity, reliability, and responsiveness of these instruments in children aged 2–4 years. The psychometric performance was compared between these adapted versions against the original (unadapted) Y-3L and Y-5L in the same sample of children and families. It was hypothesised that the adapted versions would have improved psychometric performance and would be more sensitive at detecting decrements in HRQoL compared with the original versions in children aged 2–4 years.

2 Methods

2.1 Sample

Survey data were collected via the Australian paediatric multi-instrument comparison (P-MIC) study [19]. Participants were screened for eligibility and provided informed consent. Participants included parents/caregivers of children aged 2–18 years (inclusive), as well as their child/ren. Participants were either recruited through the Royal Children’s Hospital in Melbourne, Australia, or via an online survey company, Pureprofile, in Australia. Participants recruited through the hospital sample included parents/caregivers and children with a wide range of health conditions, while participants recruited through Pureprofile included a general population and a sample of parents/caregivers of children with a variety of health conditions. For this study, only survey data from Data Cut 2 (10 August 2022) of parents/caregivers of children aged 2–4 years, including a 4-week follow-up, were used [20]. The 4-week follow-up was used as this was considered an appropriate time frame for responsiveness.

2.2 Survey

The initial survey included questions about the participants’ sociodemographic characteristics and current general health status (including the child’s health condition/s). The initial and follow-up surveys presented participants with multiple HRQoL instruments, including the adapted and original Y-3L and Y-5L descriptive systems and the visual analogue scale (EQ VAS). A range of other HRQoL instruments were included, such as the PedsQL, Health Utility Index (HUI2/3), and CHU9D. Parents/caregivers of children aged 2–4 years completed the instruments as proxies (parent/caregiver rating), providing assessments from their perspective of the child. Due to the block design of the study, not all HRQoL instruments were offered to all participants. For example, not all participants who completed the adapted Y-3L completed the original Y-5L. As such, the total number of participants completing each instrument will vary. The study was designed this way to minimise the burden on respondents as it was considered not feasible to ask all respondents to complete each instrument in the context of the wider P-MIC study. The feedback about the responder burden was obtained from a consumer advisory group during the design phase of this study. The order of the HRQoL instruments was randomised to minimise ordering effects, and EuroQol instruments were separated by a non-EuroQol instrument to minimise repetition [21]. The order of instrument administration was consistent for each participant in the follow-up survey. The 4-week follow-up survey included the HRQoL instruments as well as questions about changes in the child’s health status. Survey responses were collected and stored through REDcap [22].

2.2.1 Special Healthcare Needs

A child was determined to have special healthcare needs (SHCNs) by using the validated SHCN screener [23]. The screener involves five questions relating to the child’s current health condition, use of medication, and healthcare services. SHCNs were used to categorise the distribution of responses and known groups for validity testing as it is expected that children with SHCNs would have poorer HRQoL compared with children without SHCNs [24].

2.3 Health-Related Quality of Life Instruments

The EQ-5D-Y consists of a descriptive system that contains five items referring to the dimensions: mobility, usual activity, looking after themselves, pain or discomfort, and feeling worried, sad, or unhappy [11, 12, 25]. The use of the EQ-5D-Y instrument asks the respondent to choose a level for each dimension that best describes their health today. The Y-3L response levels range from having ‘no problems’ at level 1, to ‘a lot of problems’ at level 3, whereas the Y-5L response levels range from having ‘no problems’ at level 1 to ‘extreme problems’ at level 5 [26]. The EQ-5D-Y is accompanied by the EQ VAS, a visual analogue scale that ranges from 0 (worst imaginable health) to 100 (best imaginable health).

The adapted EQ-5D-Y was developed with permission of the EuroQol Research Foundation for proxy use in children aged 2–4 years [18]. In this publication, the modified EQ-5D-Y will be referred to as the ‘adapted EQ-5D-Y for 2- to 4-year-olds’. The five dimensions of the EQ-5D-Y were maintained; however, the level wording and examples were altered to improve relevance and appropriateness for the proxy use and reporting about children aged 2–4 years (electronic supplementary material [ESM] Fig. 1). The adapted EQ-5D-Y is intended to be used by an adult rater, and not self-completed by the child. The qualitative study that adapted the instrument has reported the qualitative process and has reported justification for each of the modifications [18]. The key adaptations included the following.

  • Each dimension was given an example.

  • Mobility dimension: ‘walking about’ was changed to ‘movement’ in the level wording.

  • Mobility, looking after themselves usual activity dimensions: ‘at an age-appropriate level’ was added to the dimension examples.

  • Looking after themselves dimension: ‘washing or dressing’ was changed to ‘helping look after themselves’ in the level wording.

  • Pain or discomfort, worried, sad, or unhappy dimensions: ‘unusually’ was added to the example.

Responses on the instruments were analysed using results at both dimension level (proportion of respondents reporting problems on each level of severity in each dimension) and using the level sum score (LSS) to summarise results on the descriptive system. The LSS is an equally weighted sum score of the levels across all dimensions and is typically used in descriptive and exploratory analyses and when a value set is not available to provide a summary utility score for EQ-5D-Y health states. The Y-3L and Y-5L LSS can range between 5 (no problems on all dimensions) to 15 or 25 (worst level on all dimensions), respectively.

A Y-3L value set was applied to estimate the utility of health states described by the adapted and original Y-3L on the grounds that the dimensions and levels are the same between the two. In the absence of an Australian value set for Y-3L, the Netherlands [27] value set was used. Although available value sets for the Y-3L differ in terms of characteristics, emerging evidence suggests value sets for European countries share similar characteristics [28]. Analysis of utility-weighted data was restricted to the Y-3L as value sets for Y-5L are not available. A full analysis is provided in the ESM.

PedsQL is a generic, non-preference-weighted HRQoL instrument developed for use in children aged 2–18 years [15]. PedsQL contains 23 items across four dimensions: physical functioning, emotional functioning, social functioning, and school functioning. The total PedsQL score is calculated by taking the average score of the four dimensions, where a score of 100 is equal to perfect health and a score of 0 is equal to worst possible health.

2.4 Psychometric Performance

Feasibility, convergence (see the ESM), distribution of responses, ceiling effect, known-group validity, reliability, and responsiveness were evaluated for each instrument to assess psychometric performance. Comparisons between the adapted and original instruments are guided by prespecified methods, and interpretation of results is guided by prespecified protocols outlined in the QUOKKA P-MIC technical methods guide and psychometric guide [20].

2.4.1 Feasibility

Feasibility was measured by examining the respondents’ reported difficulty completing the instruments. Chi-square tests were used to determine the significance of differences between the proportions of reported difficulty.

2.4.2 Distribution of Responses

The proportion of instrument responses at each dimension level was compared for the total sample and children with and without SHCNs [29]. A greater proportion of responses in the more severe levels of the dimensions (i.e., higher frequency of respondents reporting at level 3 for the Y-3L, or at levels 3/4/5 for the Y-5L) for children with health conditions was considered a positive indicator of the instrument’s ability to identify expected HRQoL decrements. A greater distribution of responses was considered desirable, given the concern regarding a new modification’s ability to detect important HRQoL decrements.

2.4.3 Ceiling Effect

The percentage of respondents reporting level 1 in all dimensions for each instrument was calculated at the individual dimension level and profile level. Ceiling effects were considered high if the percentage of respondents scoring perfect health on an instrument was ≥15% [30]. Ceiling effects are expected to be lower in the subpopulations of children with conditions or SHCNs.

2.4.4 Known-Group Validity

Known-group validity (construct validity) assesses an instrument’s ability to discriminate between groups of individuals identified as having different health characteristics [31]. Known groups were prespecified to include:

  • children with or without SHCNs;

  • caregiver perception of child’s current health being poor/fair/good or very good/excellent;

  • EQ VAS score less than or greater than 80 [32]

  • PedsQL total score less than or greater than 74.9 (one standard deviation below the proxy-reported population mean for children aged 2–4 years) [15]

  • Children with or without a series of specific chronic conditions reported by the parent/caregiver (including, asthma, attention-deficit hyperactivity disorder [ADHD], autism, behavioural problems, constipation, developmental delays, eczema, food allergy, hay fever, and sleeping problems).

Known-group validity was measured by examining the difference in the groups’ mean LSS scores. The non-parametric Mann–Whitney U test was used to test the difference in LSS between the groups. Cohen’s D effect sizes were calculated to determine the size of any between-group difference, and were categorised based on standard thresholds of 0.2–0.49, 0.50–0.79 and >0.8, indicating small, medium, and large effect sizes, respectively [33].

2.4.5 Test–Retest Reliability

Test–retest reliability assesses the reliability of an instrument to generate a consistent score between two time points, assuming there have been no changes in health status between the two time points. Only participants who completed the 4-week follow-up survey and reported no changes in the child’s general health were included in the analysis of test-retest reliability.

For the reliability of LSS, intraclass correlation coefficients (ICCs) were calculated using a two-way mixed effects model with absolute agreement. ICC values of < 0.4, 0.4–0.59, 0.60–0.74, and > 0.75 were indicative of poor, fair, good, and excellent reliability, respectively [34].

For reliability of individual dimensions of instruments, Cohen’s weighted kappa coefficients were calculated using linear weights (equal penalty), [35] where coefficient values of 0.2, 0.21–0.40, 0.41–0.6, 0.61–0.80 and >0.81 were indicative of poor, fair, moderate, substantial, and almost perfect agreement, respectively [34].

2.4.6 Responsiveness

Responsiveness assesses an instruments’ ability to detect changes in HRQoL over time [36]. Responsiveness was measured by examining differences in mean LSS at baseline compared with follow-up according to caregiver perceived and reported changes in the child’s health. Reported changes in child health involved caregivers reporting whether their child’s health had worsened, not changed, or improved at follow-up. The expected changes in LSS were a decrease in LSS for improved health, no change in LSS for no change in health, and an increase in LSS for worsened health.

The mean LSS for the two time points were compared using a paired t-test. A standardised response mean (SRM) was calculated to determine the magnitude of the change. SRM was calculated by dividing the mean change in LSS by the standard deviation of the change for the two time points [33]. SRM values < 0.2, 0.2–0.49, 0.5–0.79 or > 0.8 were indicative of trivial, small, medium, and large effect sizes, respectively [33].

Descriptive responsiveness was assessed using the Paretian classification of health change, alongside probability of superiority [37]. Paretian classification of health change compares the health state between baseline and follow-up. Health is considered ‘improved’ if at least one dimension is better without any other being worse, with the inverse for ‘worsened’ health. Health is considered ‘mixed’ when at least one dimension is worse while any other one is better, and considered ‘the same’ if there are no changes. Probability of superiority, a non-parametric effect size, is the probability that within a randomly sampled pair of dependent scores, the score obtained at follow-up will be smaller (i.e., better) than the score obtained at baseline. A probability of superiority of < 0.5, = 0.5 or > 0.5 was considered indicative of a higher probability of worsened health, equal probability of improvement and worsened (or no change), and a higher probability of improved health, respectively.

Statistical analyses were performed in RStudio version 4.0.2 (The R Foundation for Statistical Computing, Vienna, Austria) [38]. Additional R packages used for the analysis included psych [39] and effsize [40]. P statistical significance was determined by a p-value <0.05.

3 Results

3.1 Descriptive Statistics

A total of 842 parents/caregivers of children aged 2–4 years completed the survey, with 33.1% of children having an SHCN (Table 1). Of the total sample, there were slightly more males (54%) and children aged 4 years (39%) (Table 1). A total of 572, 264, 572 and 270 parents/caregivers completed the adapted Y-3L, original Y-3L, adapted Y-5L and original Y-5L, respectively (ESM Table 1). Participants who completed the original Y-3L all completed the adapted 3L, and those who completed the original Y-5L all completed the adapted Y-5L (ESM Table 1). Overall, the samples were relatively even, with the adapted Y-3L and Y-5L having slightly younger children with more SHCNs than the original Y-3L and Y-5L. The distribution of caregivers’ perception appeared similar between the adapted Y-3L and Y-5L and the original Y-3L and Y-5L.

Table 1 Descriptive statistics

3.2 Feasibility

Most respondents found both the adapted and original versions easy to complete, with no significant differences between the rates of respondents reporting difficulties (6%, 5%, 4% and 4% of respondents reporting difficulties for the adapted and original Y-3L and the adapted and original Y-5L, respectively (ESM Fig. 2).

3.3 Distribution of Responses

The adapted Y-3L had a greater proportion of responses in the more severe levels (levels 2 and 3) in most of the dimensions, particularly for the usual activity (28.5% vs. 16.7%), mobility (14.2% vs. 4.2%), and pain or discomfort (21.9% vs. 16.7%) dimensions compared with the Y-3L (Fig. 1, ESM Table 4). The adapted Y-3L had a greater proportion of responses in the highest level for all dimensions compared with the original. When focusing on children with SHCNs, the adapted Y-3L had a greater proportion of responses in the highest level for all dimensions compared with the original.

Fig. 1
figure 1

Distribution of instrument-level responses of adapted EQ-5D-Y-3L and original EQ-5D-Y-3L by total sample and children with and without SHCNs. Frequency of responses for instrument dimensions are displayed. The frequency of responses at each level is depicted as: yellow = 1, turquoise = 2, purple = 3. a Adapted EQ-5D-Y-3L total sample; b original EQ-5D-Y-3L total sample; c adapted EQ-5D-Y-3L for children with no SHCNs; d original EQ-5D-Y-3L for children with no SHCNs; e Adapted EQ-5D-Y-3L for children with SHCNs; (f) Original EQ-5D-Y-3L for children with SHCNs. SHCNs special healthcare needs

The difference in the spread of responses was more evident when comparing the adapted and original Y-5L (Fig. 2, ESM Table 5). The adapted Y-5L had a greater proportion of responses in the higher levels (levels 3, 4 and 5) in all dimensions, except for worried, sad, or unhappy. When focusing on children with SHCNs, the adapted Y-5L had a greater proportion of responses in the higher levels (levels 3, 4, and 5) for mobility, usual activities, and pain or discomfort.

Fig. 2
figure 2

Distribution of instrument-level responses of adapted EQ-5D-Y-5L and original EQ-5D-Y-5L by total sample and children with and without special healthcare needs. Frequency of responses for instrument dimensions are displayed. The frequency of responses at each level is depicted as: yellow = 1, light green = 2, turquoise = 3, blue = 4, purple = 5. a Adapted EQ-5D-Y-5L total sample; b original EQ-5D-Y-5L total sample; c adapted EQ-5D-Y-5L for children with no SHCNs; d Original EQ-5D-Y-5L for children with no SHCNs; e adapted EQ-5D-Y-5L for children with SHCNs; f original EQ-5D-Y-5L for children with SHCNs. SHCNs special health care needs

3.4 Ceiling Effect

At the individual dimension level, the adapted Y-3L and Y-5L had lower ceiling effects compared with the original for each dimension (except the ‘looking after self’ dimension) in the total sample and subsample of children with SHCNs. The adapted Y-5L also had lower ceiling effects for each dimension compared with the adapted Y-3L.

High ceiling effects at the profile level were observed for all instruments in the total sample, however this was less so for the adapted and original Y-5L (33% and 27%, respectively) compared with the adapted and original Y-3L (42% and 40%, respectively) (Table 2). The proportion of individuals reporting perfect health/no issues was almost halved for the subsample of children with SHCNs for all instruments, although the instruments (except Y-5L) were still considered to have high ceiling effects (proportions > 15%) [20]. The high ceiling effect was absent for the subsample where the parents’ perception of the child’s health was less than very good (good/fair/poor) [adapted Y-3L: 11.5%; original Y-3L: 8.3%; adapted Y-5L: 9.9%; original Y-5L: 11.5%).

Table 2 Ceiling effects of the original and adapted EQ-5D-Y-3L and EQ-5D-Y-5L

3.5 Known-Group Validity

All instruments were able to discriminate mean LSS between known groups where the child has an SHCN, the parents’ perception of the child was less than very good (good/fair/poor), EQ VAS scores < 80, and PedsQL total score <69.7 (Table 3). The group mean differences of mean LSS were all statistically significant, with a range of moderate to large Cohen’s D effect sizes. The adapted instruments had a wider difference in LSS between the known-groups and had larger Cohen’s D effect sizes compared with the original versions.

Table 3 Known-group validity—Cohen’s D effect sizes of the original and adapted versions of the EQ-5D-Y-3L and EQ-5D-Y-5L for different known health groups

For discriminating between children with and without SHCNs, the adapted Y-5L had a larger LSS difference of −2.93 with a larger effect size of 1.01 (0.83–1.19) compared with a −2.00 LSS difference and effect size of 0.83 (0.52–1.14) for the original Y-5L. Similarly, the adapted Y-3L had a larger LSS difference of −1.81 with a similar effect size of 1.03 (0.85–1.21) compared with a − 1.35 LSS difference and effect size of 1.02 (0.73–1.21).

All instruments were able to discriminate between children with and without health conditions (e.g., asthma). The group mean differences of mean LSS were all significant (except for the asthma group in the original Y-5L), with a range of small to large Cohen’s D effect sizes. The largest effect sizes were seen for autism, behavioural problems, and developmental delay. The adapted Y-3L had larger Cohen’s D effect sizes for 8 of the 10 health conditions (including asthma, autism, and behavioural problems) compared with the original, whereas the adapted Y-5L had larger Cohen’s D effect sizes for all 10 health conditions reported compared with the original.

Comparable results were seen in the known-group analysis when using utilities to preference weight the profile data, with the adapted Y-3L also having larger Cohen’s D effect sizes compared with the original Y-3L for the majority of the known-groups (ESM Table 6). The original Y-3L had a larger Cohen’s D effect size for autism and ADHD when scored as utilities, whereas the adapted Y-3L had larger effect size for these conditions using LSS.

3.6 Test–Retest Reliability

When using LSS, the adapted Y-3L had larger ICC scores with fair to excellent reliability compared with the original version with fair reliability [ICC 0.81 (0.76–0.85) vs. 0.59 (0.38–0.47), respectively] (Table 4). Similarly, the adapted Y-5L had larger ICC scores with fair to excellent reliability compared with the original version with poor to fair reliability.

Table 4 Test-retest reliability—weighted kappa coefficients and ICC scores of the original and adapted versions of the EQ-5D-Y-3L and EQ-5D-Y-5L using level-summed scores

When using utility scores, the adapted Y-3L also performed better for test-retest reliability than the original Y-3L (ICC 0.75 vs. 0.47) [ESM Table 7].

Similarly, when looking at the dimension level, the adapted versions had larger weighted kappa coefficients compared with the original (Table 4), indicating superior test-retest reliability. The adapted Y-3L and Y-5L had mostly moderate to substantial agreement of the five dimensions compared with mostly no agreement to poor agreement for the original Y-3L and Y-5L.

3.7 Responsiveness

The majority of participants reported no changes in the child’s health after 4 weeks (Table 5). The change in LSS between baseline and follow-up appeared to change in the hypothesised direction for each instrument, with improved health having a lower LSS at follow-up, no change in health having no/very small differences in LSS between both time points, and worsened health having a higher LSS at follow-up compared with baseline. Due to small sample sizes, responsiveness results of the adapted and original instruments were not formally compared; the original Y-3L and Y-5L responsiveness results are described in ESM Table 8.

Table 5 Level sum score difference and standardised response mean of the adapted versions of the EQ-5D-Y-3L and EQ-5D-Y-5L for changes in health at follow-up

For improved health at follow-up, the adapted Y-5L had an LSS change of −0.61, with a statistically significant small SRM of 0.27, with the LSS change for the Y-3L being non-significant (Table 5). For worse health at follow-up, the adapted Y-3L had an LSS difference of 0.86, with a small non-significant SRM of 0.34.

Similar results were seen in the responsiveness analysis using changes in caregiver perception of the child’s health using LSS and utilities. The adapted Y-3L had a statistically significant SRM for better and worse health (ESM Tables 9 and 10). The adapted Y-5L had an LSS change of −1.18, with a statistically significant small SRM of 0.36 (ESM Table 9).

4 Discussion

4.1 Summary

The adapted Y-3L and Y-5L demonstrated overall good psychometric performance, with improved distribution of responses, known-group validity, and test-retest reliability compared with the original versions. Strong convergence between the adapted and original instruments was seen for both the Y-3L and Y-5L (ESM Tables 2 and 3). All instruments were able to discriminate between multiple known-groups, had no ceiling effects when focusing on the parent-reported health status of their child being poor/fair/good, and showed some responsiveness with a small effect size consistent with the hypothesised change in LSS at follow-up. Additionally, there were no differences between the instruments in terms of participant-reported difficulties.

4.2 Distribution of Responses

The adapted Y-3L and Y-5L had a greater proportion of responses in the more severe levels compared with the original Y-3L and Y-5L, more so in the subsample of children with SHCNs. Results were expected according to prespecified hypotheses, as the adapted versions had improved-level wording and examples that were more appropriate and relevant in children aged 2–4 years compared with the original instruments. There was some evidence that the adapted versions showed improved sensitivity to detect issues. The distribution of responses for the adapted instruments was evident in the mobility, looking after themselves, and usual activity dimensions. This is not surprising as most changes made to the wording and examples in the adapted version were for the above dimensions. The addition of ‘at an age-appropriate level’ was also only used for the examples of the above dimensions [18]. Framing the proxy reporter to consider age-appropriateness when completing their child’s HRQoL may be related to the larger proportion of responses in the more severe levels. These changes could possibly be due to reversing response shifts. For example, if a child has an illness that prevented them from ever playing on the playground, this activity may not be part of the child’s usual activity. As such, a parent may report ‘no problems’ in usual activity for this child. However, when asked to think about how usual activities relate to ‘age appropriate’ expectations, they might identify and report problems. Some parents may have shifted the other direction. For example, a parent of a well child completing the original instrument may indicate issues as a child might not have met a certain milestone, however the addition of ‘age appropriate’ to the examples of the adapted version may result in parents moving from reporting issues to no issues as the children may not yet be at the appropriate age to meet these milestones. However, the net movement was from less severe to more severe levels. Further research is needed to qualitatively identify the extent to which each change in the adapted version leads to an increased ability to appropriately capture decrements in HRQoL.

The use and framing of examples for each dimension could have implications for the wider suite of EuroQol instruments. For example, the notation of ‘at an age-appropriate level’ may help to improve the sensitivity of the instruments for both very young and old individuals. Additional studies are needed to explore how the framing of tasks affects proxy responses. Likewise, the change in wording to ‘movement’ rather than ‘walking about’ may similarly apply to other populations such as older children and the elderly where walking is not the only valued aspect of mobility.

4.3 Ceiling Effect

Higher ceiling effects are commonly reported for short PRO measure instruments, such as the EQ-5D-Y. This is likely due to the design, meaning relative ease of obtaining the best HRQoL response for each question. It is unclear whether the ceiling effect cut-off of 15% is equally appropriate for all instruments, and whether the cut-off is an issue for general population samples due to the hypothesised prevalence of good health. The high ceiling effect at the profile level in this study was more than halved when focusing on children with SHCNs, and completely removed when focusing on children with health reported as poor, fair, or good. This is one of the first studies to show that the Y-5L version has lower ceiling effects compared with the Y-3L version in a sample of children with health conditions.

When comparing the adapted EQ-5D-Y against the original, the adapted instruments had a slightly higher ceiling effect at the profile level, but had a slightly lower ceiling effect at the individual dimensions compared with the original instruments. This is likely due to shifts in responses described above, however it is difficult to ascertain the size and direction for each of the adaptations made to the instrument. Additional research would be useful to explore the impact of each of the instrument adaptations on the reported HRQoL.

4.4 Help Looking After Themselves

The dimension of looking after themselves had the most wording changes in the adapted version. There were a large proportion of responses at the tail ends for the adapted EQ-5D-Y compared with the original for this dimension (i.e., a larger proportion of responses at level 1 and level 3 for the adapted Y-3L, or level 4 and 5 for the adapted Y-5L). The adapted versions only had a larger ceiling effect for the looking after themselves dimension compared with the original versions, although higher ceiling effects would typically be seen as problematic, and this increase highlights the likely improvement in the descriptive system achieved by the adaptations. Between the ages of 2 and 4 years, children are rapidly developing and are transitioning between being fully dependent to becoming increasingly able to do things for themselves. The original wording of the ‘looking after self’ dimension in the EQ-5D-Y implies independence, therefore parents completing the instrument for their 2- to 4-year-old may report issues in this dimension due to the inappropriateness of the wording. Including a dimension example, adding the ‘at an age-appropriate level’ in the dimension example and changing the level wording from ‘washing or dressing’ to ‘with helping look after themselves’, likely caused parents/caregivers to not report issues that are likely due to age rather than health.

4.5 Limitations

As a result of the block design in which not all instruments were offered to all participants, most respondents for the adapted versions were recruited from a hospital sample, whereas most respondents for the non-adapted versions were from an online sample [20]. As expected, this resulted in differences seen in the proportion of children with SHCNs in each group, which could impact the interpretation of results, particularly for the distribution of responses and ceiling effects. To minimise this possible effect, the data were subsampled by children with SHCNs where possible.

Sample sizes were small for some of the condition-specific known-group and responsiveness analyses [36]. Despite the sample sizes being below what is recommended, there was consistency between the known health condition groups and the more general known groups (i.e., children with and without SHCNs), showing larger Cohen’s D effect sizes (with 95% confidence interval displayed) in the adapted versions compared with the original. For responsiveness analyses, despite the small sample sizes for worse health groups, changes in HRQoL were consistent with hypothesised changes in mean LSS. Caution should still be taken for the interpretation of these results.

LSSs were used over utility scores as no value sets currently exist for the Y-5L and adapted instruments. There are some limitations to using LSS [41]. As LSS is calculated by simply summing up the scores of the five EQ-5D-Y dimensions, it means multiple profiles would result in the same LSS value. For example, a Y-3L profile of 11321 and 22112 both generate an equal LSS value of 8, the conversion into a utility score would likely result in two very different values. In addition, although a profile that gives an LSS of 8 indicates worse health than a profile that gives an LSS of 7, the utility value of the profile that gives an LSS of 7 may be worse due to the wide overlap. These issues may not be so problematic at the tail ends of an LSS, as there are only a small number of profiles that would generate the same LSS (i.e., there is only one profile that has an LSS of 5), and it is more of an issue for the middle LSSs. For example, there are 381 profiles for the Y-5L that each generate an LSS of 15. It is encouraging that the supplementary analysis using utilities (for the adapted and original Y-3L) largely confirmed the results produced from LSS analysis. Further analyses will prove valuable once more value sets become available. As the adapted Y-3L does not have its own value-set, the Dutch EQ-5D-Y-3L value-set was used to undertake these supplementary analyses. There are a few limitations to this. Although the adapted Y-3L is a modification of the original Y-3L, the two instruments are not completely identical (despite sharing the same dimensions and levels), therefore preferences could in principle differ between the original and adapted versions. Furthermore, it is unknown if preferences for health states will differ for younger children covered by the adapted version of the Y-3L. The valuation protocol to create a value-set for the Y-3L asks participants to consider a 10-year-old child [42]. It is plausible that participants may respond differently for a child of a different age. Further research is required to test if preferences differ for the health states generated by the adapted Y-3L and Y-5L instrument compared with the original versions, and if so, how to address that.

There are a range of limitations and biases with proxy reporting in young children. Although studies have shown conflicting evidence around the parent-child agreement, it is implausible to assess this in a very young age group. Although researchers have concerns about the validity of the proxy reporter to appropriately measure unobservable concepts of HRQoL, it is possible that the new co-designed framing of the examples may assist the parent proxy reporter in the absence of a self-report alternative. Qualitative research is needed to explore how the modifications of the adapted Y-3L and Y-5L are interpreted by parents and caregivers.

4.6 Additional Research

As the adapted instrument is newly developed, additional research is needed to assess whether the instrument is equally well-accepted in other countries, settings, and samples. The adapted EQ-5D-Y does not include children below 2 years of age as further changes to the instrument would likely be needed for this unique infant/toddler age group, as demonstrated through the wider range of dimensions qualitatively developed as part of the EQ-TIPS [43]. Research comparing the psychometric performance of the EQ-5D-Y alongside other existing preference-weighted measures in children aged 2–4 years, and comparing against the well-validated PedsQL, would be useful to distinguish relative acceptability, feasibility, and validity for different groups of young children. Additional research in validating the adapted EQ-5D-Y in various disease groups is also needed. A comparison of the adapted EQ-5D-Y against other instruments available for this age range, such as CHU9D, and the newly developed Health Utilities Pre-School (HuPS), and HUI2/3, is also needed to assess relative psychometric performance. The psychometric evidence of the HuPS has shown excellent inter-rater reliability between parent and clinician proxy reporting, as well as good feasibility and respondent burden [44]. Psychometric evidence of the HUI2/3 and CHU9D exists, however no evidence is currently available for children aged 2–4 years. The benefit of the adapted EQ-5D-Y is that it is part of a suite of instruments that are widely recommended by HTA bodies globally, with potential for high levels of consistency across ages. Additionally, the family of instruments could be used to measure HRQoL over a full age range, allowing for cost-effectiveness analyses from different age groups to be directly compared or incorporated within a single economic model. How utilities transition when switching between each age-specific instrument is still unknown.

Given the good psychometric performance associated with the changes made to the descriptive system of the adapted EQ-5D-Y, it raises the question as to how best to value this instrument, which will be required to aid its practical use. It is unclear whether the existing value sets of the Y-3L (for older children) could be applied to preference-weight profile data collected using the adapted EQ-5D-Y, and if not, what the unique considerations may be in generating values for this adapted version. In part, this links to wider methodological questions: should age-specific instruments be accompanied by age-specific value sets, and if not, what other alternatives exist? [45] On the basis of significant differences in development and life stage, one expects that the stated preferences of adults for the health of 2- to 4-year-olds might be different than that of a 10-year-old (which is the current age used in the Y-3L value-set protocol). However, emerging research has shown minimal impact on values when framed by different child ages. In the study by Ramos-Goñi et al., the lowest age considered was a 5- to 7-year-old perspective [46]. In a similar study by Retra et al., although the authors note the EQ-5D-Y health states were not valued significantly different when the description of age differed, their findings do show half the health states (three of six health states) were valued statistically (one borderline) differently between a 4-year-old and a 16-year-old [47]. Furthermore, this study did not include an own-self adult perspective, which would likely show larger differences [47]. Whether or not values are age-dependent may depend on the methodology used. Further research is required to test if preferences differ for the health states generated through this adapted Y-3L and Y-5L instrument, and if so, to generate or map those preferences.

5 Conclusion

The adapted Y-3L and Y-5L demonstrated good psychometric performance in children aged 2–4 years. Both adapted versions had large Cohen’s D effect sizes for almost all health condition known groups, with the largest effect sizes seen for autism, developmental delays, and behavioural problems. The adapted instruments also had strong test retest reliability and responsiveness.

The adapted versions appear to have improved reliability, validity, responsiveness, and lower ceiling effects at the individual dimension level compared with the original Y-3L and Y-5L for use in children aged 2–4 years. This study also suggests that the adapted Y-5L may be superior in the 2- to 4-year age group as the instrument showed improved distribution of responses (greater sensitivity to poor health states) and smaller ceiling effects (at the dimension and profile level) in a total and SHCN samples when compared with the adapted Y-3L.