FormalPara Key Points for Decision Makers

There is currently a lack of evidence on which instruments should be used to measure health-related quality of life (HRQoL) in children and this study produces substantial evidence to support decision makers in using HRQoL evidence from the instruments.

This study compared the performance of the PedsQL, EQ-5D-Y-3L, EQ-5D-Y-5L, CHU9D, AQoL-6D, and HUI3 descriptive systems in a large diverse sample of Australian children using an exploratory level sum score approach, finding the EQ-5D-Y-5L and CHU9D performed well on all psychometric attributes evaluated in the total sample. Furthermore, the PedsQL, EQ-5D-Y-3L, AqoL-6D, and HUI3 performed well on most psychometric attributes evaluated in the total sample, with all instruments demonstrating evidence of known-group, convergent, and divergent validity.

Considering results for ceiling effects, test–retest reliability, and responsiveness varied by child age and report type, future instrument users should consider the decision-making context, target population, and study design when deciding which instrument will be most appropriate.

1 Introduction

Paediatric health-related quality of life (HRQoL) instruments aim to aid in understanding how a child’s health impacts different aspects of their life, such as physical, emotional and social wellbeing [1, 2]. Generic paediatric HRQoL instruments focus on aspects of health that are relevant to most children and can be used in any child population, enabling comparisons across conditions and settings [3]. Generic HRQoL instruments can be either accompanied by preference weights, a weighting algorithm used to score the HRQoL instrument into a single value anchored from 1 (full health) to 0 (equivalent to dead) derived from preferences of the general public, or not. Where accompanied by preference weights, generic HRQoL instruments can be used to generate quality-adjusted life-year (QALY) estimates, which are then used in economic evaluation [4]. This evidence is formally used in many countries to inform healthcare resource allocation [5, 6]. HRQoL instruments with or without accompanying preference weights can be used in population studies, clinical registries, or routine clinical care as patient-reported outcome measures (PROMs) [7, 8]. Evidence from these uses of HRQoL instruments informs public funding decisions, access to treatments, and clinical care outcomes. Therefore, it is crucial to have good evidence on the measurement accuracy of generic HRQoL instruments.

An instrument’s measurement accuracy can be informed by assessment of its psychometric properties. Although there is some evidence available regarding the individual psychometric performance of generic paediatric HRQoL instruments [9], there is a paucity of evidence on the comparative psychometric performance of these instruments (i.e., how these instruments perform compared with one another in the same sample) in terms of validity, reliability, and responsiveness [10, 11]. There are challenges measuring HRQoL in children when compared with adults, which may be why such evidence is still lacking [12]. Unlike adults, children have a broad developmental range, precluding the use of a single approach for measuring HRQoL in all children [13]. There have been five previous reviews of generic paediatric HRQoL instruments [9,10,11, 14, 15]. A 2015 systematic review examined HRQoL instruments accompanied by preference weights that were previously used in paediatric populations: the Adolescent Health Utility Measure, Assessment of Quality of Life (AQoL-6D), Child Health Utility (CHU9D), the EuroQol Group’s EQ-5D-Y-3L, Health Utilities Index Mark 2 (HUI2) and Mark 3 (HUI3), 16D, 17D, and the Quality of Wellbeing Scale (QWB) [11]. The review concluded more evidence on the comparative validity of these instruments is required [11]. Another review was conducted in 2015 to assess the psychometric performance of a wider range of paediatric HRQoL instruments [15], including 35 generic paediatric HRQoL instruments with and without accompanying preference weights identified in a previous review [16]. The 2015 review highlighted the lack of evidence regarding the test–retest reliability and responsiveness of instruments, with no included studies having evaluated instrument responsiveness [15]. A 2021 systematic review examined the psychometric performance of key generic paediatric HRQoL instruments: the CHU9D, EQ-5D-Y-3L, HUI2, and HUI3 [10]. The review identified that there were key limitations in the current evidence precluding clear conclusions on the comparative performance of instruments [10]. Key limitations included the small sample size of some studies, and differences in study design, statistical methods and samples (age and condition) across studies, which limits accurate assessment of comparative performance [10]. A 2023 systematic review updated and expanded the 2021 review by including a wider set of generic instruments and studies [9]. The 2023 review explored the level of evidence available for each psychometric attribute, highlighting test–retest reliability, among others, as a psychometric attribute with the least evidence and where further research is required [9]. Except for the Janssens et al. 2015 review [15], a limitation of these reviews is their focus on HRQoL instruments that have accompanying preference weights, which misses the comparative performance of other generic paediatric HRQoL instruments that are commonly used in clinical care, population studies, and clinical registries, such as the Pediatric Quality of Life Inventory (PedsQL), which is commonly used in Australia.

The Australian Paediatric Multi-Instrument Comparison Study (P-MIC) was undertaken to address the limitations and gaps in the current evidence base for paediatric HRQoL instruments, with a focus on improving decision making in the Australian context [17]. The P-MIC study is the first of its kind to compare common paediatric HRQoL instruments head-to-head across a wide range of child ages and health statuses [17]. Results published in a separate manuscript outline the quality of the P-MIC study data and the acceptability and feasibility of the HRQoL instruments included, to children and their caregivers [18]. Using survey data from the P-MIC study, this paper aims to compare the psychometric performance of the PedsQL, EQ-5D-Y-3L, EQ-5D-Y-5L, CHU9D, AQoL-6D, and HUI3 descriptive systems by child age, report type (self- vs proxy report), and child health status.

2 Methods

P-MIC study data from children aged 5–18 years (inclusive) were used [17, 19]. P-MIC participants (children and their caregivers) were recruited between June 2021 and August 2022 into three samples: Sample (1) children with or without health conditions recruited via a large tertiary paediatric hospital based in Victoria, Australia; Sample (2) general population children recruited via an online panel available nationally (Pureprofile Australia); and Sample 3) children from nine condition-specific groups (attention-deficit/hyperactivity disorder (ADHD), anxiety and/or depression, autism spectrum disorder (ASD), asthma, eating disorder, epilepsy, recurrent abdominal pain, sleep problems, and tooth problems) recruited via the same online panel as above or—for rarer conditions—via patient organisations associated with the condition. P-MIC study data were from Data Cut 2, dated 10 August 2022. Data cut 2 includes approximately 94% of the total planned P-MIC participants.

2.1 Data Collection

All participants consented and completed an initial survey online via REDCap. Participants were then asked to complete a second online follow-up survey at 4 weeks. A small subset of participants from the online panel general population sample (Sample 2) were asked to complete the follow-up survey at 2 days to enable assessment of test–retest reliability.

All instruments were self-completed by the participant (i.e., no instruments were interviewer administered). Instruments were either proxy reported by the caregiver or self-reported by the child. Children aged 7 years or older who were deemed by their caregiver as currently able to complete questions about their health completed the HRQoL instruments themselves (child self-report), otherwise these were completed by the caregiver (proxy report). Where an instrument was proxy reported, the proxy was asked to rate the child’s health from their perspective (i.e., from the caregiver’s perspective).

For further information on P-MIC study methodology, including details of participant recruitment (i.e., quotas), survey structure, instruments, survey questions, and statistical analysis plans, please see the technical methods paper [19].

2.2 Instruments

The PedsQL core generic version 4.0, EQ-5D-Y-3L and EQ-5D-Y-5L, CHU9D, AQol-6D adolescent, and HUI3 were included in both the initial and follow-up surveys. As per the prespecified protocol [17], the PedsQL EQ-5D-Y-3L and EQ-5D-Y-5L and CHU9D were included in the core set of instruments received by all participants, and the HUI3 and AQoL-6D were included as additional instruments that only some participants were randomised to receive. Although the study team wanted to include all instruments for all participants, feedback was received from the consumer group during the design phase of the study expressing concern about responder burden. Hence, efforts were made by the study team to reduce responder burden where possible. The HUI3 and AQoL-6D were not included in the sample recruited via hospital (Sample 1) to minimise responder burden (following patient feedback), and in the online panel samples (Samples 2 and 3), participants were randomised to receive either the HUI3 or the AQoL-6D or another generic instrument not included in this analysis. A summary of participants who received each instrument is available in Table 1, and characteristics of instruments included in analysis are available in Supplementary Table 1 (see electronic supplementary material [ESM]). The order of the core set of instruments (PedsQL, EQ-5D-Y-3L, EQ-5D-Y-5L, and CHU9D) was randomised to minimise order effects, and there was always another instrument between the EQ-5D-Y-3L and EQ-5D-Y-5L given their similarity. The AQoL-6D or HUI3 were completed after the other generic HRQoL instruments. Where participants were allocated to an instrument, they were required to answer all instrument questions; hence, there is no missing HRQoL instrument data.

Table 1 Participant characteristics by child age, report type, and child health status

The priority for which instruments to include in the study, and which instruments to include in the core set (all participants receive) or additional set (only some participants randomised to receive) was determined by the study team following a review of key literature available at the time of study design [10, 11, 15], and consultation with experts (including clinical, health technology assessment, health economist, government, and consumer experts). The decision was guided by the following factors: (1) instruments commonly used to measure HRQoL in children (instruments were prioritised if they had evidence of strong psychometric performance from single studies), (2) instruments used to measure HRQoL in children that had been recently developed and were likely to be commonly used in future, and (3) instruments that would be useful in informing policy and healthcare decision making in Australia. It was not a requirement that instruments had preference weights available, although the study team did consider which instruments had preference weights available at the time of study design and which may be likely to have preference weights available in future. For example, although the EQ-5D-Y-5L and PedsQL did not have preference weights available at the time of study design, they were considered instruments that would quite likely have preference weights available in future. Further details on the justification for the inclusion of each instrument are available in the published study protocol [17].

2.3 Instrument Scoring Used for Analysis

The PedsQL total score was calculated by reverse scoring and linearly transforming raw item responses (0 = 100, 1 = 75, 2 = 50, 3 = 25, 4 = 0), then the sum of all item scores was divided by the number of items [20]. PedsQL domain scores were calculated using a similar approach, where raw items were linearly transformed, and the sum of all item scores in a domain was divided by the number of items in that domain [20]. An exploratory level sum score (LSS) approach was used to obtain an overall instrument total score for all other instruments. LSSs were calculated by summing together the numerical value attached to each item response (e.g., 1 for ‘no problems’ and 5 for ‘extreme problems’ in the EQ-5D-Y-5L) for all items in the instrument. The total score range possible for each instrument varies and is described in Supplementary Table 1 (see ESM). The LSS approach is considered exploratory. It has some advantages in providing an equally weighted score for comparison but other disadvantages, such as its non-normal distribution and inability to distinguish between health states that may be quite different from one another [21]. In addition, there were a lack of preference weights available for all instruments included in this study. Furthermore, the aim of this analysis is to understand the descriptive systems of each instrument.

2.4 Statistical Analysis

Analyses were completed in Stata Version 17 (StataCorp, Texas, US). Statistical tests, hypotheses, and thresholds were based on the statistical analysis protocol set a priori by the study team; the statistical analysis protocol is available in the technical methods paper [19]. Where appropriate, subgroup analyses were completed using the following prespecified subgroups: child age (5–12 years vs 13–18 years), report type (proxy vs self-report) and health status (children without a special healthcare need vs children with a special healthcare need) [19, 22]. The child age subgroups (5–12 years and 13–18 years) reflect key child development stages (pre-adolescence and adolescence), and this age cut point is consistent with the PedsQL instrument age versions, which is one of the most well-validated paediatric HRQoL instruments [23]. Adjusting for multiple comparisons was not required in the primary analyses as all statistical tests were hypothesis driven or included different samples. Adjusting for multiple comparisons may have been applicable for subgroup analyses; however, given this is not commonly performed in the research field of psychometric analysis, we opted for an approach where subgroup analyses were not adjusted for multiple comparisons.

2.4.1 Distribution of Responses

Distribution of responses was evaluated by descriptively assessing participant responses to each instrument item. The distributions of responses were visually inspected. Additionally, the total instrument ceiling and floor effects were assessed. As this study includes general population children and children with health conditions, ceiling effects were assessed only in children with a special healthcare need [22], as these children were expected to report health problems on HRQoL instruments. An instrument was considered to have a ceiling or floor effect if > 15% of participants with a special healthcare need reported the lowest severity (e.g., ‘no problems’) or highest severity category across all items. This 15% threshold is based on previous thresholds used in the literature [24, 25].

2.4.2 Test–Retest Reliability

Test–retest reliability was assessed by comparing instrument total scores between initial and follow-up measurements for participants who reported no change in health and were allocated to receive their first reminder for the follow-up survey at 2 days. Only participants in the online panel general population sample were allocated to receive the follow-up survey at 2 days. Test–retest reliability was assessed using intraclass correlation coefficient (ICC) estimates and corresponding 95% confident intervals. ICC estimates were calculated based on an absolute-agreement, two-way mixed-effects model [26]. As per Koo and Li (2016), an ICC of < 0.5 indicates poor reliability, 0.50–0.74 moderate reliability, 0.75–0.90 good reliability, and > 0.90 excellent reliability [26]. An ICC ≥ 0.5 (moderate reliability) was considered acceptable test–retest reliability. Primary analysis was completed using Koo and Li (2016) thresholds [26], however, it is acknowledged that other thresholds for interpreting ICC results exist. Cicchetti (1994) thresholds were applied in a sensitivity analysis. These thresholds state that an ICC of < 0.4 indicates poor agreement, 0.40–0.59 indicates fair agreement, 0.60–0.74 indicates good agreement, and ≥ 0.75 indicates excellent agreement [27].

2.4.3 Known-Group Validity

Known-group validity was assessed by comparing groups with expected differences in HRQoL, which were set a priori by the study team [19]. Group differences were assessed by comparing the mean instrument total score for each group, and effect sizes were estimated using Cohen’s d [28]. Effect sizes of 0.2–0.49 were considered small, 0.5–0.79 moderate, and ≥ 0.8 large [28, 29]. A mean difference with a p value of < 0.05 and a large effect size (≥0.8) was considered acceptable. Children with a special healthcare need were considered a known group who were hypothesised to have differences in HRQoL compared with children without special healthcare needs [30]. Additionally, sensitivity analysis was conducted on other known groups: children with a chronic health condition, EQ VAS score ≤  80 [31], PedsQL total score ≤  69.7 (one standard deviation below the child self-reported population mean for children aged 5–18 years), and PedsQL total score ≤  74.2 (child self-reported mean from a sample of children with chronic conditions) [20]. PedsQL known-group cut points were not used to assess the known-group validity of the PedsQL and were only used to assess known-group validity of other instruments.

2.4.4 Convergent and Divergent Validity

The assessment of an instrument’s convergent or divergent validity usually requires a ‘gold standard’ to compare against, to see how much another instrument converges or diverges from this gold standard. Although there is currently no gold standard instrument for measuring quality of life in children, the PedsQL is a very commonly used instrument that has undergone extensive content validity testing [20, 23]. Hence, for the purposes of assessing the convergent and divergent validity of instruments, the PedsQL was chosen as the comparator instrument. Convergent and divergent validity were assessed by correlating each item in the EQ-5D-Y-3L, EQ-5D-Y-5L, CHU9D, AQoL-6D, and HUI3 with each item and domain in the PedsQL. Correlations were calculated using Spearman’s correlation, as data were not normally distributed. Correlations of 0.1–0.29 were considered weak, 0.3–0.49 moderate, and ≥ 0.5 strong [28]. Through an a priori consensus approach involving members of the study team, different instrument item combinations were reviewed to assess if the study team hypothesised the item of one instrument would be at least moderately correlated with a PedsQL item (to assess for convergence) or not correlated at all with a PedsQL item (to assess for divergence) [19]. Hypotheses were based on similarity (convergence) or dissimilarity (divergence) of item wording [19]. The proportion of an instrument’s items hypothesised to be at least moderately correlated with the PedsQL items and that resulted in at least a statistically moderate correlation were assessed to evaluate convergent validity. The proportion of an instrument’s items hypothesised not to be correlated with the PedsQL items and that resulted in a statistically weak correlation were assessed to evaluate divergent validity.

2.4.5 Responsiveness

Responsiveness was assessed by comparing the mean difference in total instrument score between initial and follow-up surveys for children whose caregiver reported the child had a change in health between the initial and follow-up survey. Analysis focused on participants allocated to receive the follow-up survey at 4 weeks. Responsiveness was assessed by comparing the mean total score at initial and follow-up survey using a paired t-test. A mean difference in the expected direction with a p value of < 0.05 was considered acceptable and was used as the main indicator of responsiveness. Responsiveness was also assessed by calculating the standardised response mean (SRM) to provide a more detailed picture of instrument responsiveness [32]. SRM is a ratio of the mean change to the standard deviation of that change [32]. An SRM of 0.2–0.49 was considered small, 0.5–0.79 moderate, and ≥ 0.8 large [28, 29, 32]. Caregivers were asked to report their child’s change in health in the follow-up survey. Change in health was calculated as follows:

  1. 1.

    Change in general health was reported as (1) much better, (2) somewhat better, (3) about the same, (4) somewhat worse, or 5) much worse. Responses were split into two categories for analysis: ‘much better’ and ‘somewhat worse and much worse’.

  2. 2.

    For participants who reported a health condition in the initial survey, caregivers were asked to report their child’s change in ‘main health condition’. The same categorisation used for change in general health was applied.

Responsiveness was only assessed in those who reported a change in health as ‘much better’, rather than those who reported ‘somewhat or much better’. It was felt that a more stringent classification would provide a clearer indication that a change in health had occurred. Due to small sample sizes in the number of children who had worsening health, this same stringent classification was not possible, and ‘somewhat worse and much worse’ were pooled together.

A sensitivity analysis was conducted whereby the responsiveness analysis described above was repeated in only participants recruited via hospital (Sample 1), as this sample had a higher follow-up survey response rate compared with other samples.

2.4.6 Summary of Psychometric Performance

The psychometric performance of all instruments was summarised by categorising each instrument as 1) having significant evidence of good performance (tick), 2) having significant evidence of poor performance (cross) or, 3) having inconclusive evidence of performance (question mark) for each psychometric attribute assessed. Significant evidence of good performance (tick) for each psychometric attribute was based on the following thresholds:

  • Response distribution (no ceiling effect), < 15% of participants with a special healthcare need report the lowest severity or frequency level (e.g., ‘no problems’) across all instrument items.

  • Test–retest reliability, moderate, good, or excellent agreement (ICC ≥ 0.5).

  • Known-group validity, mean difference with a value of < 0.05 and large effect size (Cohen’s d effect size ≥ 0.8).

  • Convergent and divergent validity, items at least moderately correlated (Spearman’s correlation ≥ 0.3) with other instrument items where hypothesised to be correlated (convergent validity), and weakly correlated (Spearman’s correlation < 0.3) where hypothesised not to be correlated (divergent validity).

  • Responsiveness, significant mean difference (p value < 0.05).

An instrument was considered to have inconclusive evidence for a psychometric attribute if the sample size used to assess the psychometric attribute was too small (i.e., inadequate or doubtful according to the 2019 Consensus-based Standards for the selection of health Measurement Instruments [COSMIN] guidelines) [33], or the direction of evidence was unclear.

3 Results

3.1 Participant Characteristics

Supplementary Fig. 1 summarises the participant flow (see ESM). The sample characteristics of the 5945 children aged 5–18 years and their caregivers who completed the P-MIC initial survey are presented in Table 1.

3.2 Response Distribution

Supplementary Figs 1a-7d (see ESM) summarise the item response distribution for each instrument by child age, report type, and health status for each instrument. Table 2 summarises instrument ceiling effects. Only the EQ-5D-Y-3L had a ceiling effect in the overall sample, with more than 15% of children with a special healthcare need reporting the lowest severity level (i.e., ‘no problems’) across all items. No instruments had a floor effect.

Table 2 Ceiling effects of each instrument, by child age and report type

Minimal variation in ceiling effects were observed by subgroup (child age and report type) for all instruments except the HUI3, which had a higher ceiling effect when proxy reported (12.7%) compared with self-reported by the child (5.0%).

3.3 Test–Retest Reliability

Mean days between initial and follow-up survey completion for participants included in the test–retest analysis was 4.6 days. Table 3 summarises results of the test–retest analyses, showing the ICC for each instrument total score (calculated using an exploratory LSS approach) by child age and report type for participants who were allocated to complete the follow-up survey at 2 days, completed the follow-up survey, and reported no change in health. In the total sample, the PedsQL, EQ-5D-Y-3L, and EQ-5D-Y-5L demonstrated ‘good’ agreement (ICC 0.75–0.9), with 95% confidence intervals within the ‘moderate’, ‘good’ and ‘excellent’ agreement range (ICC ≥ 0.5). The CHU9D demonstrated ‘moderate’ agreement (ICC 0.64), with a 95% confidence interval within the ‘moderate’ range (ICC 0.52–0.74). Samples sizes for the AQoL-6D and HUI3 may be too small to robustly assess test–retest reliability.

Table 3 Test–retest reliability of each instrument, by child age and report type

Minor variation in performance was observed by subgroup. Except for the EQ-5D-Y-5L when proxy reported (ICC 0.48) and the CHU9D in children aged 13–18 years (ICC 0.31), all instruments demonstrated at least ‘moderate’ agreement in all subgroups assessed (ICC ≥ 0.5).

Additionally, when alternate ICC thresholds from Cicchetti (1994) were applied in a sensitivity analysis, interpretation of results varied slightly, with instruments interpreted as having better agreement using the Cicchetti (1994) thresholds [27]. In the total sample, the CHU9D was considered to have ‘good’ agreement under the Cicchetti (1994) thresholds [27], compared with ‘moderate’ under the Koo and Li (2016) thresholds [26]. Again, in the total sample, the PedsQL, EQ-5D-Y-3L, and EQ-5D-Y-5L were considered to have ‘excellent’ agreement under the Cicchetti (1994) thresholds [27], compared with ‘good’ under the Koo and Li (2016) thresholds [26].

3.4 Known-Group Validity

Table 4 summarises the known-group validity for children with and without special healthcare needs using the total score of each instrument. All instrument total scores (calculated using an exploratory LSS approach) demonstrated statistically significant differences with large effects sizes, across all known groups. This finding was consistent across child age groups and report types (see Supplementary Table 2 in the ESM), as well as additional known groups tested (children with a chronic health condition, VAS score ≤ 80, PedsQL total score ≤ 69.7, and PedsQL total score ≤ 74.2) (See Supplementary Table 3 in the ESM).

Table 4 Known-group validity of each instrument

3.5 Convergent and Divergent Validity

Supplementary Tables 4–8 provide the Spearman correlations for all instruments compared with the PedsQL at the item level. These tables demonstrate that most instruments, when correlated with the PedsQL at the item level, had > 50% of items moderately or strongly correlated, except for the HUI3 which had the least number of items correlated with the PedsQL. Almost all instrument items that were hypothesised to have at least a moderate correlation with a PedsQL item did, indicating convergent validity of all instruments. Additionally, almost all instrument items that were hypothesised not to be correlated with a PedsQL item were weakly correlated, indicating divergent validity of all instruments.

3.6 Responsiveness

Table 5 summarises the responsiveness of each instrument’s total score (calculated using an exploratory LSS approach) for children whose caregiver reports a change in their general health or main health condition between initial and follow-up surveys. Where caregivers reported that their child’s general health or main health condition was ‘much better’ at the follow-up survey compared with the initial survey, the EQ-5D-Y-3L, EQ-5D-Y-5L, and CHU9D total instrument scores all demonstrated a significant mean difference between initial and follow-up survey, in the direction expected, albeit with small effect sizes. Where caregivers reported that their child’s general health or main health condition was ‘somewhat worse or much worse’ at the follow-up survey compared with the initial survey, the PedsQL, EQ-5D-Y-3L, EQ-5D-Y-5L, and CHU9D total instrument scores all demonstrated a significant mean difference between initial and follow-up survey, in the direction expected with small effect sizes. Sample sizes for the AQoL-6D and HUI3 may be too small to robustly assess responsiveness.

Table 5 Responsiveness of each instrument, by change in general health and change in main health condition

Sensitivity analyses were conducted using only the sample of children recruited via hospital (Sample 1) due to the lower follow-up response rates in the online panel samples (Samples 2 and 3) (See Supplementary Table 9 in the ESM for baseline characteristic differences between 4-week completers and non-completers), and these results are presented in Supplementary Table 10 (see ESM). Results from the sensitivity analysis are consistent with the main analysis, although effect sizes were larger with respect to responsiveness to improved health.

Responsiveness results varied by subgroup and these results are presented in Supplementary Table 11 (see ESM). The EQ-5D-Y-5L and CHU9D were the only instruments to demonstrate responsiveness across all subgroups to improving health (including main health condition and general health), in terms of a significant mean difference. The EQ-5D-Y-5L was the only instrument to demonstrate responsiveness across all subgroups for worsening health (including main health condition and general health), in terms of a significant mean difference. Among the subgroup of children aged 13–18 years with worsening general health and the subgroup of proxy report for both worsening main health condition and general health, sample sizes may be too small to robustly assess responsiveness.

3.7 Summary of Psychometric Performance

Table 6 summarises the performance of all instruments for all psychometric analyses completed. In the total sample, the EQ-5D-Y-5L and CHU9D demonstrated evidence of good performance, meeting the prespecified cut offs, for all psychometric analyses completed. The PedsQL also demonstrated evidence of good performance, with the only exception being lack of responsiveness for improving health and the EQ-5D-Y-3L demonstrated evidence of good performance with the only exception being ceiling effects. The AQoL-6D and HUI3 demonstrated evidence of good performance for all psychometric attributes able to be assessed; however, due to sample size limitations, conclusions regarding test–retest reliability and responsiveness were not able to be drawn.

Table 6 Summary of instrument psychometric performance, by child age and report type

Additionally, results for ceiling effects, test–retest reliability and responsiveness varied by child age and report type.

4 Discussion

We have generated new evidence on the comparative psychometric performance of the PedsQL, EQ-5D-Y-3L, EQ-5D-Y-5L, CHU9D, AQoL-6D, and HUI3 descriptive systems by child age (5–12 years vs 13–18 years), report type (self- vs proxy report), and child health status. The EQ-5D-Y-5L and CHU9D demonstrated evidence supporting their psychometric performance in the total sample, based on prespecified criteria, for all assessments conducted—ceiling and floor effects, test–retest reliability, known-group validity, convergent and divergent validity, and responsiveness. Only the EQ-5D-Y-3L had evidence of a ceiling effect among children with special healthcare needs; however, it is important to note these ceiling effects were mild, with the proportion only just over the 15% threshold (17.8%). No instruments had floor effects. The PedsQL, EQ-5D-Y-3L, EQ-5D-Y-5L, and CHU9D demonstrated moderate–excellent test–retest reliability in the total sample. All instruments demonstrated known-group validity with large effect sizes. All instrument items were correlated with PedsQL items in the hypothesised directions, demonstrating convergent and divergent validity. The EQ-5D-Y-3L, EQ-5D-Y-5L, and CHU9D demonstrated some responsiveness to improvements in health with small effect sizes. The PedsQL, EQ-5D-Y-3L, EQ-5D-Y-5L, and CHU9D demonstrated some responsiveness to worsening health with small effect sizes. Importantly, ceiling effect, test–retest reliability and responsiveness results varied by child age and report type.

The P-MIC study builds on previous research comparing adult HRQoL instruments in a Multi-Instrument Comparison (MIC) study [34], and is the first of its kind worldwide to collect common paediatric HRQoL instruments head-to-head concurrently across a large and diverse sample. It allows for a direct comparison of generic instruments across a range of child ages and health statuses within the same dataset, which is currently missing from the literature [10, 11]. The P-MIC study was designed with strong sample quality assessment procedures, resulting in high-quality data from both hospital and online samples to enable robust conclusions [18, 35].

Ceiling effects are statistically more likely with shorter instrument length and shorter recall period which is likely what is observed in this study, with higher ceiling effects for the EQ-5D-Y-3L [36]. These findings are consistent with previous studies [37,38,39,40]. In adults, the EQ-5D-5L is preferred over the EQ-5D-3L, given the reduction in overall ceiling effects [41]. Although a previous study noted no difference in ceiling effects between the EQ-5D-Y-3L and EQ-5D-Y-5L [38], this study demonstrated the EQ-5D-Y-5L had slightly lower overall ceiling effects compared with the 3L (14.8% vs 17.8%) in the total sample, which is consistent with the finding of a recent systematic review of EQ-5D-Y performance [42]. Information on ceiling effects can be useful when informing the choice of instrument for specific samples; for instance, future instrument users may wish to avoid the use of instruments with stronger ceiling effects, such as the EQ-5D-Y-3L, in relatively well populations of children where there is a higher risk of very large ceiling effects.

This study found evidence of good test–retest reliability for the PedsQL, EQ-5D-Y-3L, and EQ-5D-Y-5L, and evidence of moderate test–retest reliability for the CHU9D in the total sample. A previous review found no significant evidence of test–retest reliability for CHU9D and mixed evidence of test–retest reliability for the EQ-5D-Y-3L and HUI3 [10]. The previous review includes only one study assessing test–retest reliability of the CHU9D, which assesses the test–retest reliability of the dimensions rather than the whole instrument; hence, results may not be comparable to this study [43]. However, other studies have reported significant evidence of test–retest reliability for the CHU9D in China and Sweden [44, 45]. This study also found some instruments demonstrated better test–retest reliability than others in different subgroups, with only the PedsQL and EQ-5D-Y-3L demonstrating evidence of good or excellent test–retest reliability across all subgroups. These results highlight the importance of assessing the psychometric properties of instruments across a wide range of participants, as instrument performance can vary depending on the population it is assessed in. Additionally, given the lack of evidence previously available regarding test–retest reliability [9, 14, 15], the newly generated evidence from this study helps to fill an important gap in the literature.

Although almost all instrument items hypothesised to be at least moderately correlated with PedsQL items were at least moderately correlated, indicating convergent validity of instruments, not all were strongly correlated. Our a priori hypotheses were driven by the wording of each item, thus moderate correlations might be observed rather than strong correlations due to other differences between the items, for example, the response options (severity vs frequency); the item format (statements vs Likert scale items) and the recall period (today vs 1 week vs 1 month). Additionally, instrument items hypothesised not to be at least moderately correlated with PedsQL items were found largely not to be correlated, indicating divergent validity of instruments. This may be due to instruments measuring different aspects of HRQoL to the PedsQL. For example, the HUI3 had the least number of items correlated with PedsQL items and this finding is consistent with expectations, as the HUI3 had the fewest items hypothesised to be correlated with PedsQL items. It is likely that the HUI3 measures different aspects of HRQoL to the PedsQL, such as vision, hearing, speech, and dexterity. This has been explored in more detail in a companion paper assessing the dimensionality of instruments.

In the total sample, the EQ-5D-Y-3L, EQ-5D-Y-5L, and CHU9D were responsive to improvements in health, and the PedsQL, EQ-5D-Y-3L, EQ-5D-Y-5L, and CHU9D were responsive to worsening health; however, despite demonstrating significant mean differences, effect sizes were small. When the responsiveness analysis was restricted only to participants recruited via hospital (Sample 1), results were consistent with those from the main analysis; however, effect sizes were slightly larger with respect to responsiveness to improved health, although were still in the small–moderate range. This may be because participants from the sample recruited via hospital (Sample 1) report poorer HRQoL in the initial survey compared with participants recruited from the online panel (Sample 2 and 3); hence may have more capacity for improved HRQoL at the follow-up survey. A previous review of common paediatric HRQoL instruments found evidence of responsiveness for EQ-5D-Y-3L and mixed evidence of responsiveness for CHU9D and HUI3 [10]. This finding is consistent with that of another more recent review that also found mixed evidence of responsiveness for the CHU9D and HUI3 [9]. This more recent review also found evidence of responsiveness for the EQ-5D-Y-5L [9]; however, this was only from one study. In the P-MIC study, researchers were reliant on a change in health happening naturally between the two surveys, as well as caregivers being able to accurately report this change. This may be less clear than defining a change in health based on a clinical intervention or external clinical information, which may be why responsiveness effect sizes are smaller than those previously published. Previous literature assessing responsiveness is limited [14, 15], meaning that the new knowledge generated through this study is a significant addition.

This study has several limitations. Firstly, sample sizes for the AQoL-6D and HUI3 are too small to robustly assess test–retest reliability and responsiveness results; hence, these results have not been presented. This limitation arises due to these instruments being presented in randomised blocks to reduce responder burden. Secondly, the reliance on online report means that we do not have any non-participant reported information, such as clinical tests or clinical reports; hence, we have no external indicator against which to compare the responses given by participants. We dealt with this using strong data quality control procedures [18], but acknowledge uncertainties with online data collection. Thirdly, the use of LSS to assess known-group validity, test–retest reliability, and responsiveness is a limitation. Although the total score on the PedsQL is an accepted way to score responses [23], using LSS for the EQ-5D-Y-3L, EQ-5D-Y-5L, HUI3, and CHU9D is more challenging. The LSS is a very crude measure and may miss the complexity and relative importance of the different dimensions and levels that are captured by utility scores [46]. Future work on the P-MIC data will explore the psychometric performance of instruments using preference weights. LSS was used due to the lack of established preference weights for all instruments included in this study. Further to this point, a limitation for the implementation of these results is that not all instruments have available value sets to enable scoring for use in economic evaluations, or these are not available across many countries, precluding their use in country-specific contexts, which is an area for future research. It is noted that there is a mismatch between strongest psychometric performance and availability of value sets, which will practically limit those wishing to use instruments to produce utilities. For example, the EQ-5D-Y-5L was considered a top performing instrument in this study, however, there are currently no value sets available for this instrument, which limits its applicability in an economic evaluation. Fourthly, the P-MIC study includes a select range of commonly used generic paediatric HRQoL instruments; although as many instruments were included in the study as was feasible, it is important to acknowledge that other instruments exist that are not captured in this study. Finally, although this study provides evidence of performance in an Australian population, it is unknown if these results are generalisable across countries.

This study addresses major gaps in current evidence for child HRQoL instruments and provides an overview of the performance of commonly used child HRQoL instruments, however, several gaps remain and should be the focus of future research—assessment of performance by more granular child ages, assessment of performance using utility scores (once available), assessment of performance by different child health conditions, test–retest reliability and responsiveness of the AQoL-6D and HUI3, and assessment of performance in different countries.

5 Conclusion

The results presented here have implications for the use and choice of instruments in the design of studies, and use of HRQoL data in decision making. In the total sample, the EQ-5D-Y-5L and CHU9D descriptive systems demonstrated evidence of significant performance, based on prespecified criteria, across all psychometric attributes tested—ceiling and floor effects, test–retest reliability, known-group validity, convergent and divergent validity, and responsiveness (improving and worsening health). Additionally, the PedsQL, EQ-5D-Y-3L, AQoL-6D, and HUI3 descriptive systems performed well on most psychometric attributes evaluated in the total sample. Importantly, instrument performance varied by child age and report type for ceiling effect, test–retest reliability and responsiveness results, hence future instrument users should consider their intended population and study design when selecting an instrument. For example, when looking at the CHU9D and EQ-5D-Y-5L, the EQ-5D-Y-5L did not demonstrate acceptable test–retest reliability when proxy reported, the CHU9D did not demonstrate acceptable test–retest reliability in children aged 13–18 years, and the CHU9D did not demonstrate significant responsiveness to worsening general health in children aged 5–12 years. Furthermore, different instruments may capture different aspects of HRQoL and including only one instrument may result in important aspects of HRQoL potentially not being measured. The choice of instrument in studies may be guided by the population being assessed and aspects of HRQoL that are expected to be impacted by a condition or treatment. Those designing studies may consider including multiple instruments to ensure a more holistic assessment of HRQoL and to enable sensitivity analysis of HRQoL inputs in economic evaluation.