FormalPara Key Points for Decision Makers

Our results indicate that the CHU9D, PedsQL, EQ-5D-Y-3L and EQ-5D-Y-5L perform equally well in children and adolescents with ADHD or anxiety and/or depression, regarding acceptability/feasibility, known-group and convergent validity. However, relative strengths of the CHU9D and PedsQL were observed regarding their lack of ceiling effects, greater test–retest reliability, and consistently good performance across all psychometric properties. The CHU9D and EQ-5D-Y-3L were the most responsive to improvements in health status.

Careful consideration of the choice of instrument is advised, as the top performing instrument varied across the psychometric properties examined, and within some subgroups. The choice of which instrument is best to use may differ depending on the intended use of the data, and the age, gender, report type and type of mental health condition of the population in which the instrument is being used.

1 Introduction

Mental health and substance use disorders are the leading cause of disability in children and adolescents, globally, accounting for a quarter of all years lived with a disability [1]. In Australia, attention-deficit/hyperactivity disorder (ADHD), conduct disorders, anxiety and depression are the most common conditions affecting children aged 4–11 years [2]. It is well recognised that mental health problems early in life have a substantial impact on health-related quality of life (HRQoL) in childhood and adolescence [3, 4]. HRQoL is a multi-dimensional construct that captures the impact of health status on different aspects of physical, social and psychological functioning, either through self- or proxy report [5]. Information on children’s HRQoL can be used in research and clinical settings to compare the relative impact of various health conditions and treatments on children’s lives; identify groups of children with the greatest need; and aid in clinical decision making and treatment planning [5]. HRQoL has also been endorsed as important to include in an international overall paediatric health standard set (OPH-SS) of outcome measures in children and young people of all ages, and across all health conditions [6]. In addition to its clinical importance, HRQoL information holds significant value in health policy decision making. Most prominently, it is fundamental to the calculation of quality-adjusted life-years (QALYs), which combines measures of the quality and quantity of life into a single metric. Whilst a vast range of instruments exist to measure HRQoL [7], generic preference-weighted [8, 9] measures are preferred in economic analyses due to their ability to generate a single, weighted, summary metric, and facilitate comparisons across different types of conditions. The incorporation of valid and reliable HRQoL information within economic evaluations is a priority to ensure informed decisions are made regarding the comparative value of proposed or existing interventions. Ultimately, this would contribute to a health system that maximises outcomes for children, and maximises the value of children’s mental health care. Problematically, very little evidence exists of the validity and reliability of the generic HRQoL instruments that would enable this kind of evaluation in child and adolescent mental health settings.

Perhaps contributing to this dearth of information are the myriad challenges that arise when measuring HRQoL in children and adolescents and specifically in those with mental health challenges. Comprehensive discussions of these issues have been published previously [10,11,12] and include the reliance on proxy reports of HRQoL in very young children or those unable to self-report; the inconsistency between child self-report and proxy-reported HRQoL; and the vast differences in developmental stage in the ages 0–18 years. Additionally, adult evidence has indicated that commonly used HRQoL instruments such as the EQ-5D and the SF-6D can be valid and reliable in general populations, yet show variable performance across different mental health populations [13, 14], where better performance has been observed in anxiety and depression than schizophrenia or bipolar disorder. In children and adolescents with mental health problems, these measurement complexities are compounded, requiring specific research efforts to determine the validity and reliability of HRQoL instruments in children of different ages, and across different mental health conditions.

A review by Mierau et al. [15] compared 22 generic HRQoL instruments using existing published literature, with the aim of determining suitable instruments for use in economic evaluation of child and adolescent mental health care. The authors concluded that none of the included instruments were ‘perfect’ for this use, based on each instrument’s level of psychometric research/evidence; availability of a proxy version; suitability for young children (<8 years); availability of an age-specific value set for children under 18 years; and degree of focus on mental-health-related domains. However, of the existing instruments, the highest rated on the authors’ scale was the CHU9D, followed by the EQ-5D-Y-3L and the PedsQL. However, as the authors noted, there is limited evidence of the validity and reliability of these instruments, and other commonly used HRQoL instruments, in children and adolescents with mental health challenges.

The limitations of existing evidence are made clear by three recent systematic reviews [16,17,18] of the psychometric performance of generic preference-weighted HRQoL instruments for children. A review by Kwon et al. [18] provides a summary of existing psychometric evidence; however, performance of the instruments in mental health populations was not reported separately from general populations, leaving performance in children with mental health conditions unclear. Sequential reviews by Rowen et al. [16] and Tan et al. [17] found that only three, and subsequently four, studies had examined instrument performance in children with mental health difficulties. Since publication of these reviews, two further studies have been published in this area. Of the six existing studies, three examined only a single instrument. One study [19] examined convergent validity of the CHU9D in Australian children aged 5–17 years (n = 200) receiving mental health services, and another [20] examined construct validity and responsiveness of the CHU9D in Danish children aged 6–15 years (n = 396) with emotional or behavioural disturbances. Neither of these studies examined the test–retest reliability of the CHU9D. Another [21] examined acceptability and construct validity of the EQ-5D-Y-5L in a small sample (n = 52) of Swedish children aged 13–17 years with mixed mental health conditions. Responsiveness and test–retest reliability were not assessed. From these studies, it remains unclear how the CHU9D and EQ-5D-Y-5L perform comparatively with other generic HRQoL instruments.

Three of the six previous studies conducted a multi-instrument comparison, though each has limitations. Two of these studies were conducted in the United States [22, 23] using an overlapping sample of adolescents aged 13–17 years (n = 392) with and without depression. These studies examined the known-group validity [22] and responsiveness [23] of the HUI2/3, the PedsQL, the adult EQ-5D-3L, SF-6D and Quality of Wellbeing Scale (QWB). These studies only examined a limited number of psychometric properties of the measures, and within a single clinical population, leaving much of their psychometric performance unknown in teens with depression, and equally unknown whether instrument performance differs for younger children, or those with different mental health challenges.

Most recently, Mihalopoulos et al. [24] conducted a multi-instrument comparison of paediatric HRQoL instruments in children with mental health challenges in a clinical sample (n = 426) of children aged 7–18 years. The sample largely comprised children with internalising disorders (i.e. anxiety or depression, 75.8%), though with some externalising disorders (i.e. ADHD, conduct disorder, oppositional defiant disorder, 15.7%) or trauma/stress disorders (8.5%). This study examined the convergent validity and known-group differences based on severity of mental health of a range of generic paediatric HRQoL instruments (EQ-5D-Y-3L, HUI2/3, CHU9D, AQoL6D, and PedsQL). However, as the study was cross-sectional, the comparative performance of the instruments regarding responsiveness to change in health status, and test–retest reliability was not possible, and these areas remain a significant gap in this literature [18]. None of the existing studies examined whether performance of the HRQoL instruments varied by child sex.

There remains a significant gap in our understanding of the comparative psychometric performance of available generic HRQoL measures that may be suitable for use in children and adolescents with mental health challenges. Assessing the validity and reliability of these instruments in the most common mental health conditions is a priority to ensure the health resources currently being directed to these health conditions are appropriate and are producing the best possible outcomes for these children given the available healthcare funding and resources.

The aim of this study was to address this evidence gap using data from the Australian Paediatric Multi-Instrument Comparison (P-MIC) study—the largest of its kind, internationally. Specifically, we aimed to examine the relative psychometric performance (acceptability, validity, reliability and responsiveness) of a range of commonly used generic paediatric HRQoL instruments in a large sample of children with mental health challenges. We aimed to examine the comparative performance of the PedsQL, EQ-5D-Y-3L, EQ-5D-Y-5L and CHU9D overall, and across subgroups of (i) child age (4–6 years; 7–12 years; 13–18 years); (ii) gender; (iii) type of mental health condition (ADHD; anxiety and/or depression); and (iv) self- or proxy report. Where sample size allows, a secondary aim was to describe the performance of additional instruments (AQoL-6D and HUI3) across the same subgroups.

2 Methods

2.1 Study Design

Data were obtained from the Quality of Life in Kids: Key evidence to strengthen decisions in Australia (QUOKKA) Australian P-MIC study (data cut 2, dated 10 August 2022, see published technical methods [25]). The protocol and methods for the P-MIC study have been published in detail elsewhere [26, 27]. Briefly, the study involved the prospective and concurrent collection of a range of generic paediatric HRQoL instruments and condition-specific measures via an online survey. This was followed by a shorter follow-up survey at 4 weeks [25]. The P-MIC study received ethics approval via The Royal Children’s Hospital (RCH) Human Research Ethics Committee (HREC/71872/RCHM-2021) and was prospectively registered with the Australia New Zealand Clinical Trials Registry (ANZCTR) (ACTRN12621000657820). The study findings are reported in line with COSMIN guidelines [28].

2.2 Participants

Participants were a subset of the P-MIC study [26], recruited via an online panel survey company (PureProfile). The current study included the ADHD sample (aged 4–18 years) and the anxiety and/or depression sample (7–18 years). Different age ranges were used for each mental health sample due to the recommended age range for the condition-specific symptom measures relevant for each sample (see Sect. 2.3.2). Participants were included in these condition-specific samples if the caregiver answered ‘yes’ to the following questions: “Do you have a child aged 7–18 with anxiety or depression as diagnosed by a health professional?” or “Do you have a child aged 4–18 with [ADHD] as diagnosed by a health professional?”

Participants were either the child or their caregiver, as follows. Caregivers completed all sociodemographic measures. Children self-reported the HRQoL instruments if they were aged ≥ 7 years and were deemed by their caregiver to be capable of completing the survey. Alternatively, the caregiver completed all measures on behalf of the child (proxy report) if the child was (i) aged < 7 years, or (ii) aged ≥ 7 years but the caregiver deemed the child not able to complete the survey. Mental-health-specific measures were completed by the caregiver or child in the same way, with the exception of the ADHD-symptom measure, which only has a proxy report version available in this age group. Caregivers provided consent for their own participation and for their child, with additional consent provided by the child in instances of child self-report.

2.3 Instruments and Measures

2.3.1 Sociodemographic Measures

Child age (in years), was used to form three age bands for subgroup analyses: 4–6 years; 7–12 years; and 13–18 years. These age bands were chosen to distinguish between the different age groups available for each mental health sample (ADHD: 4–18 years; anxiety/depression: 7–18 years); to align with the first instance of child self-report in this study (≥ 7 years); and to broadly align with the major developmental stages of childhood and adolescence (i.e. pre-school; primary school; and high school).


Child gender was used to describe the sample and to determine individual symptom severity cut points on mental-health-specific instruments as per published norms (see Sect. 2.3.2). Available published norms do not specify appropriate cut points for children and adolescents who identify as transgender, non-binary, gender fluid or those of undisclosed gender, and we did not feel it was appropriate to apply gendered norms in these instances. While these children (n = 16) are included in the wider sample, they are not included in the known-group analyses that require the use of these symptom severity cut points.


Special health care needs (SHCN) was indicated based on the parents’ response to a SHCN screening question: “Child has a condition which has lasted or is expected to last for at least 12 months which causes them to use medicine prescribed by a doctor (other than vitamins) or more medical care, mental health or educational services. Yes/no” [29]. Additional sociodemographic data were used solely to describe the sample, as shown in Table 1.

Table 1 Sample characteristics

2.3.2 Mental Health and HRQoL Instruments and Cut Points

Details of mental health and HRQoL instruments are available in Table 2. Symptom severity cut points were calculated for the mental health symptom measures, as follows, to facilitate known-group validity testing.

Table 2 Mental health and health-related quality of life instruments

Strengths and Difficulties Questionnaire (SDQ) The SDQ is a validated screening questionnaire used to assess a child’s emotional and behavioural wellbeing [30, 31]. All participants completed the SDQ at baseline. Australian norms exist for 4–6 year olds [32] and 7–18 year olds [33] to classify children based on severity. The total score was used to determine symptom severity cut points individually for each child based on their age, gender, and self/proxy report, though were generally as follows: low level (< 12/40); borderline/query (12–16/40) or abnormal/of concern (17+/40).


Revised Children’s Anxiety and Depression Scale, Short form (RCADS-25) All participants within (only) the anxiety/depression sample completed the RCADS-25 at baseline. From the raw sum scores for the instrument, total t-scores were calculated, adjusted for child gender, age and respondent using available syntax based on United States population norms [34]. These T scores were then used to classify children into three symptom levels: ‘normal’ (< 65); ‘borderline’ (65–69); and ‘clinical’ (≥ 70) [34].


Strengths and Weaknesses of Attention-Deficit/Hyperactivity Disorder Symptoms and Normal Behavior Scale (SWAN) All participants within (only) the ADHD sample completed the SWAN at baseline. Z scores were calculated from the instrument raw sum scores, adjusted for child gender and age. We used a known-group cut point, based on a previously published threshold, of z > 0.74 [35]. This cut point has previously shown good sensitivity and specificity (AUC = 0.85–0.88) in a large general population sample of children aged 6–17 years (n = 15,560) [35]. This cut point was subsequently validated in a clinical sample [35], and found to be a more accurate classification than the cut point of 1.65 SD recommended by Swanson et al. [36]. As the sample of children used in this analysis only includes children with ADHD, the mean score will be much higher than that of a general population or mixed clinical sample [35]. Hence this standardised scoring method is used as a symptom severity cut point to identify the top 25% most severe cases from our sample.

2.4 Instrument Completion

In response to patient feedback aiming to reduce response burden, not all respondents were offered all instruments. For detail on how instruments were chosen as the ‘main’ or ‘additional’ instruments for assessment within the QUOKKA study, please see Jones et al. (within this Special Supplement – reference to be updated in editing stages) and the QUOKKA Study Technical Methods Guide [25]. All participants aged 5–18 years completed all four main HRQoL instruments (PedsQL, EQ-5D-Y-3L, EQ-5D-Y-5L and CHU9D). Participants aged 4 years completed the PedsQL and the CHU9D, but were randomised to complete either the EQ-5D-Y-3L or the EQ-5D-Y-5L. All participants were then randomised to complete either the AQoL-6D or the HUI3 in addition to the four main instruments. All participants completed the SDQ and the mental health symptom measure relevant to their sample (i.e. either the RCADS-25 or the SWAN). See published technical methods [25] for further details on instrument randomisation.

2.5 Psychometric Analyses

See Table 3 for a description of each of the psychometric analyses performed; relevant thresholds for interpreting results; and a priori hypotheses. Analyses encompassed feasibility and acceptability; floor and ceiling effects; known-group validity; convergent validity; responsiveness; and test–retest reliability. All statistical analyses were conducted using StataSE 16 (Statacorp, Texas, US). Statistical methods, subgroups and thresholds for interpretation were prespecified and are reported in a statistical analysis plan which is available in the technical methods paper [25].

Table 3 Description of psychometric analyses

2.6 Subgroup and Sensitivity Analyses

All validity, reliability and responsiveness analyses described in Table 3 were assessed using the combined sample, and within the following subgroups: (i) mental health condition (ADHD; anxiety and/or depression); (ii) age band (4–6 years; 7–12 years; 13–18 years); (iii) gender (male; female); and (iv) report type (self/proxy report). Note that subgroup analyses for gender are male/female only due to low sample size (n = 16) for transgender/non-binary/gender fluid/undisclosed children.

The preference-weighted HRQoL instruments were designed to be scored using preference weights to give a ‘utility score’ (ranging from 0 to 1) and are predominantly used in this way. However, in the absence of utility weights, for the purpose of preliminary assessment of psychometric properties, the instruments can be scored by summing the response score for each item to give a ‘level sum score’ (LSS) [9]. The LSS is the total score with equal weight for each item (e.g. for a child reporting no problems on the EQ-5D-Y-3L, this would be 1 + 1 + 1 + 1 + 1 = 5; see Table 2 for possible total sum score range for each instrument). Given the development of preference weights is still underway for the PedsQL and EQ-5D-Y-5L, to allow a comparison across all HRQoL instruments, our analyses use individual items or the instrument LSS. While this LSS may be useful in clinical settings, for descriptive systems it has limitations linked to the interpretation of equal sum scores that can be derived from quite different combinations of responses. Furthermore, preference elicitation studies have shown that in practice respondents place different importance on each item within HRQoL instruments. Despite these limitations, the LSS approach has recently been shown to form a strong Mokken scale [37] and was deemed by Feng et al. to be a meaningful measurement, particularly in samples with health conditions. Given the different interpretations arising from these two scoring approaches, a sensitivity analysis was performed that repeated the psychometric analyses—where appropriate—using utility scores as the outcome variable. This analysis was performed for instruments that have a currently available value set (i.e. the EQ-5D-Y-3L, CHU9D, AQoL-6D and HUI3). Further sensitivity analysis methods and results, including the value sets used for these analyses, are available in Supplementary Table S10 (see electronic supplementary material [ESM]).

2.7 Instrument Comparison

A summary of results was undertaken to compare the relative performance of the PedsQL, EQ-5D-Y-3L, EQ-5D-Y-5L and CHU9D across the psychometric properties, within the combined sample and all subgroups. Data for the AQoL-6D and the HUI3 are presented in the ESM only due to a lower total sample size for these instruments which precludes a direct comparison of results. For details on how to interpret each cell, see footnotes for Table 4. A total performance score was calculated for each instrument, which weights each psychometric property equally; however, individual section scores can be referred to when a combined equally weighted summary is not considered helpful.

Table 4 Summary of psychometric performance of each health-related quality of life (HRQoL) instrument for the combined sample and by subgroup

3 Results

3.1 Sample Characteristics

Survey data were collected for a total of n = 1013 children and adolescents aged 4–18 years (mean age 11.5 years, SD 4.1). Surveys were completed largely by child self-report (n = 689, 68.0%), and approximately one-third by parent-proxy report (n = 324, 32.0%). A total of n = 533 children were included in the ADHD sample and n = 480 in the anxiety/depression sample. The sample included slightly more boys (n = 566, 55.9%) than girls (n = 431, 42.5%). The response rate for the follow-up survey at approximately 4 weeks was 28.0% (n = 284 completed surveys). A subset of participants completed the AQoL-6D (n = 330) or the HUI3 (n = 370). Sample characteristics for these additional instruments were similar to the total combined sample who completed all other instruments: age (m = 11.8, SD = 3.9, m = 10.8, SD = 4.4, respectively); self-report (69.7%, 62.1%, respectively); sample (ADHD 51.9%, anx/dep 48.1%; ADHD 56.0%, anx/dep 44.0%, respectively); gender (male 59.3%, male 54.0%, respectively). Sample characteristics for the total sample are described in Table 1.

The number of respondents who completed each instrument at baseline and follow-up, for the combined sample, and within each subgroup, is described in ESM Table S1. Partial completion was not permitted by the survey platform, such that no submitted surveys had missing data. Baseline clinical and demographic characteristics between those who did or did not complete the follow-up showed no significant differences, with the exception that a greater proportion of children were from the anxiety/depression sample at follow-up (n = 156, 54.9%) compared with baseline (n = 324, 44.4%; p = 0.003). See ESM Table S2.

3.2 Psychometric Performance

3.2.1 Acceptability/Feasibility

There were no major differences observed in the acceptability/feasibility of the instruments. The majority of respondents (~ 70%) found all instruments ‘somewhat’ or ‘very’ easy to complete. In addition, ~ 20% of participants across all groups rated each instrument as ‘neither easy nor difficult’ to complete. The EQ-5D-Y-5L was most consistently rated as ‘somewhat’ or ‘very’ easy to complete (68.1–76.5%), followed by the CHU9D (67.0–75.2%). Ease of completion was similar between self- and proxy report for the EQ-5D-Y-3L, EQ-5D-Y-5L and CHU9D (all within 1.5%), however the PedsQL was considered slightly easier to complete by proxy compared with self-report (70.4% vs 66.9%). For full results, see ESM Tables S3.1–S3.10.

3.2.2 Floor and Ceiling Effects

No floor effects were detected for any instrument in the combined sample or any subgroups; moreover, no respondents reported being in the worst possible health state on any instrument.

No ceiling effects were detected in the combined sample or within any subgroups for the PedsQL (proportion of respondents in best possible health state, all < 1%) or CHU9D (all < 4%). In contrast, for the EQ-5D-Y-5L, ceiling effects were detected in the ADHD sample (18.5%), 4-to 6-year-olds (16.0%), boys (15.9%), and by proxy report (17.3%). Ceiling effects were also detected in these subgroups using the EQ-5D-Y-3L, with further ceiling effects identified for the combined sample (17.9%), 7- to 12-year-olds (17.1%), 13- to 18-year olds (18.7%) and by self-report (16.9%). Where ceiling effects were apparent, they were more likely to occur in the following subgroups: ADHD, aged 4–6 years, boys, and by proxy report. For full results, see ESM Table S4.

Of note, and as expected, ceiling effects were less likely to occur in longer instruments (statistically less likely); as well as less likely in instruments that included more items expected to be of concern for children in our sample, for example, school problems, paying attention, or cognitive domains not included in shorter instruments. This is particularly highlighted through no ceiling effects being detected for any instrument in the anxiety/depression sample, as all instruments included at least one item related to sadness, worry, or emotional problems. See ESM Table S5 for figures displaying the distribution of responses for each item on each HRQoL instrument for the combined sample, and for all subgroups.

3.2.3 Construct Validity—Known-group Validity

The four instruments in the main comparison—the PedsQL, CHU9D, EQ-5D-Y-3L and EQ-5D-Y-5L—performed almost equally well across all known-group analyses, with total scores only differing by 1 of 28 known-group comparisons. In the combined sample, the largest effect sizes were observed for differences in severity on the RCADS-25 (range: large ES = 1.08–1.49), followed by severity on the SDQ (moderate–large ES = 0.69–1.16); severity on the SWAN (small–moderate ES = 0.42–0.63) and presence of SHCN (small ES = 0.27–0.39). This same pattern of effect sizes was observed in all subgroup analyses. The instruments were equally able to identify known groups across each subgroup, with the exception that known groups were better identified via self-report than proxy report for all instruments. For full results, see ESM Table S6.

3.2.4 Construct Validity—Convergent Validity

In the combined sample, the intercorrelations between the four main HRQoL instruments were all ‘strong’ (Spearman ρ, range: 0.62–0.73, all p < 0.001), and in the expected direction. Next, examining correlations between generic HRQoL instruments and the mental health symptom measures revealed the expected pattern of ‘moderate/strong’ correlations between generic HRQoL instruments and the SDQ (ρ = 0.42–0.60; p < 0.001). However, relationships were stronger than hypothesised between generic HRQoL instruments and the RCADS-25, all ‘strong’ correlations (ρ = 0.53–0.67; p < 0.001); and weaker than hypothesised between generic HRQoL instruments and the SWAN, all ‘weak’ correlations (ρ = 0.20–0.25; p < 0.001). Subgroup analyses were largely in line with the combined sample, with a notable exception regarding the SWAN, where correlations were strengthened to ‘moderate’ in subgroups of 13- to 18-year-olds and boys. For full results, see ESM Table S7.

3.2.5 Responsiveness

Tests of responsiveness were hindered due to sample size more than any other psychometric property; notably, we were unable to examine responsiveness of the instruments to deterioration in health, therefore reporting is limited to improvements in general health and/or the mental health condition.

In the combined sample, the four main instruments all performed well; able to detect improvements in general and mental health status, though with small effect sizes (SRM range: 0.26–0.39), and with the exception of the EQ-5D-Y-5L which did not detect improvements in general health. The CHU9D was the only instrument of the four to detect improvements related to the child’s mental health condition in girls (SRM = 0.40, p = 0.006); though the EQ-5D-Y-3L was better able to detect improvements in boys’ general and mental health (SRM = 0.46, SRM = 0.38; p < 0.01, respectively). Samples sizes were inadequate (n < 30) or doubtful (n < 50) to examine responsiveness in subgroups of ADHD, all age bands, and proxy report. For full results, see ESM Table S8.

3.2.6 Test–Retest Reliability

In the combined sample, the 95% confidence interval (CI) ranged from ‘fair’ to ‘good’ for all instruments (intraclass correlation coefficient [ICC] 95% CI range 0.41–0.70). Using the ICC thresholds recommended by Cicchetti [38], test–retest reliability in the combined sample was good for the CHU9D (ICC 0.60, 95% CI 0.47–0.70), and fair for the PedsQL, EQ-5D-Y-3L and EQ-5D-Y-5L (ICC 0.59, 0.57, 0.55, respectively), though all estimates were within 0.05 of one another. Overall, the CHU9D and PedsQL most consistently showed good test–retest reliability across each subgroup, though subgroup analyses revealed relative strengths for each instrument. For example, the CHU9D was more reliable in the anxiety/depression than the ADHD sample, though conversely the PedsQL was more reliable in the ADHD compared with the anxiety/depression sample. In the anxiety/depression (ICC 0.73, 95% CI 0.56–0.83) and 13- to 18-year olds samples (ICC 0.76, 95% CI 0.63–0.85), the EQ-5D-Y-3L outperformed the CHU9D and the PedsQL. Both instruments were more reliable in boys than girls. Sample sizes were inadequate in 4- to 6-year-olds and by proxy report for each instrument.

Using the more restrictive thresholds recommended by Koo and Li [39], only the EQ-5D-Y-3L reached ‘good’ reliability, and this was only within the 13- to 18-year-old subgroup (ICC 0.76, 95% CI 0.63–0.85). No instrument reached the ‘excellent’ reliability threshold of ICC >0.90. For full results, see ESM Table S9.

3.3 Sensitivity Analyses

In sensitivity analyses using utility scores in place of instrument total sum scores for the EQ-5D-Y-3L and CHU9D, results were unchanged from the main analysis with the following exceptions. In convergent validity, correlations between mental health measures and the EQ-5D-Y-3L and CHU9D were weakened, becoming only moderate for the CHU9D and SDQ (ρ = −0.48; p < 0.001). In known-group testing, effect sizes weakened for the EQ-5D-Y-3L and CHU9D for known-group differences on the SWAN (both weak effects, d = 0.43; p < 0.001). For full results, see ESM Table S10.

3.4 Psychometric Performance of the HUI3 and AQoL-6D

Full results for the HUI3 and AQoL-6D are available in the ESM, Tables S1–S10, and are not included in the comparison with other instruments due to the lower sample size for these instruments. Briefly, approximately 70% of participants rated both instruments as ‘somewhat’ or ‘very easy’ to complete, though the AQoL-6D was rated easier to complete via proxy report compared with self-report (72.0% vs 63.5%). No floor or ceiling effects were detected for either instrument. Where sample size allowed, both instruments were able to detect differences in severity on the SDQ with moderate to large effect sizes, but were less consistently able to detect differences in severity on the SWAN, or children with SHCN. Correlations with other HRQoL instruments and mental health symptom measures were largely as hypothesised. Sample sizes were inadequate to assess responsiveness or test–retest reliability of these instruments.

In sensitivity analyses using utility scores for the AQoL-6D and HUI3, results were unchanged from the main analysis with the following exceptions. In convergent validity, correlations between mental health instruments and the AQoL-6D and HUI3 were strengthened, including a change to moderate correlations with the SWAN (ρ = −0.31 [for both]; p < 0.001). In known-group testing, improvements were seen for the HUI3, which was able to detect known-group differences with moderate effects sizes for severity of ADHD using the SWAN (Cohen’s d = 0.67; p < 0.001); and children with SHCN (d = 0.51; p < 0.001).

3.5 Instrument Comparison

Table 4 provides a high-level summary of the psychometric performance of the PedsQL, EQ-5D-Y-3L, EQ-5D-Y-5L and CHU9D in the combined sample and for each subgroup. All results and results summaries in text and in Table 4 are based on results with ‘adequate’ or ‘very good’ sample sizes based on the COSMIN guidelines [40]. Results based on ‘doubtful’ or ‘inadequate’ sample sizes are shown in the ESM tables for completeness of information to the reader only, and are not interpreted in text or in Table 4.

Strong overall performance was observed across psychometric properties for the CHU9D (85.5%) and the PedsQL (81.9%), with some differences observed at the subgroup level favouring different instruments. As expected, the shorter instruments (the EQ-5D-Y-3L and EQ-5D-Y-5L) showed ceiling effects, however, these instruments showed strong performance for feasibility/acceptability, and convergent and known-group validity. In addition, the EQ-5D-Y-3L showed good responsiveness to improvements in health.

4 Discussion

In this study, we examined the psychometric performance of a range of generic paediatric HRQoL instruments in a large sample of children with anxiety and/or depression or ADHD. Overall, we found strong performance by the CHU9D, followed closely by the PedsQL, and more variable performance by the EQ-5D-Y-3L and EQ-5D-Y-5L. The PedsQL, CHU9D, EQ-5D-Y-3L and EQ-5D-Y-5L showed similarly good performance for acceptability/feasibility, known-group validity and convergent validity. The CHU9D and PedsQL showed no floor or ceiling effects and fair–good test–retest reliability. Test–retest reliability was lower for the EQ-5D-Y-3L and EQ-5D-Y-5L. The EQ-5D-Y-3L showed the highest ceiling effects, but was the top performing instrument alongside the CHU9D on responsiveness to improvements in health status, followed by the PedsQL. In the smaller subsample, the AQoL-6D and HUI3 showed good acceptability/feasibility, no floor or ceiling effects, and good convergent validity, yet poorer performance on known-group validity. Responsiveness and test–retest reliability were not able to be assessed for these two instruments. In subgroup analyses, performance was similar for all instruments for acceptability/feasibility, known-group and convergent validity; however, relative strengths and weaknesses for each instrument were noted for ceiling effects, responsiveness and test–retest reliability. In sensitivity analyses using utility scores, performance regarding known-group and convergent validity worsened slightly for the EQ-5D-Y-3L and CHU9D, though improved slightly for the HUI3 and AQoL-6D.

Our finding of validity and reliability of the CHU9D and PedsQL is in line with the literature review by Mierau et al. [15]. Together, our findings suggest these instruments may be the most suitable—of existing HRQoL instruments—for use in economic evaluation of child and adolescent mental healthcare. However, an advantage of the CHU9D over the PedsQL is the existence of an adolescent- and Australian-specific value set for the instrument, where the PedsQL (and equally the EQ-5D-Y-3L or EQ-5D-Y-5L) have no validated value set for Australia, which currently limits their usefulness in child and adolescent mental healthcare evaluations [15].

One of the most useful metrics of HRQoL instruments is their ability to detect differences between known subgroups within a patient sample, which can be useful in developing economic models. In line with the previous study by Mihalopoulos et al. [24], we found the EQ-5D-Y instruments were able to detect known-group differences based on severity of mental health conditions. However, in our study, performance of the CHU9D and PedsQL was still very high. In line with Mihalopoulos et al., we observed poorer known-group validity using the HUI3. The similarity of our findings with Mihalopoulos et al. is notable given the use of different mental health symptom measures in our study, and the use of utility scores instead of sum scores, which leads to results with a different interpretation.

Arguably, also amongst the most important psychometric properties of HRQoL instruments for health economic analyses and description of health profiles is the ability of the instrument to detect a change in health status and reliability of scores across repeated measurement. Our findings are novel in this regard for children with anxiety and/or depression and ADHD, and suggest the CHU9D is the most responsive of the instruments to an improvement in the child’s general health and their mental health, and also had the highest test–retest reliability estimate in the combined sample. The PedsQL also performed well, though showed lower responsiveness to changes in health status. Notably, however, the follow-up survey was completed at 4 weeks, which may impact responsiveness and reliability estimates. We did not have adequate sample size to examine responsiveness to deteriorations in health, and this will be a crucial area of future research, particularly given the variable performance observed for the instruments in detecting improvements in health.

Our findings are in line with others that have found the CHU9D performs well in children and adolescents with mental health challenges [15, 19, 20]. However, with regard to the implications this has for clinical and health policy decision making, the choice of HRQoL instrument should also consider the intended use and subgroup in which the instrument would be used; the best performing instrument in each instance may differ. Others have equally noted varying strengths of different HRQoL instruments across different psychometric properties [18]. Furthermore, in addition to the psychometric properties explored here, other practical considerations should be considered in the choice of instrument, such as the length of the instrument (i.e. the PedsQL has 23 items, whereas the CHU9D and EQ-5D-Y instruments are much shorter at 9 items and 5 items, respectively); licensing fees; technology and resources available; and the availability of a country-specific value set, etc. [41]. The balance of these considerations may vary for commercially sponsored drug trials, routine clinical use, research purposes or health economic analyses. As we and others have noted, these instruments have different properties and are accompanied by utility algorithms that are fundamentally different. This, combined with the potential need to choose different instruments for different populations or research questions, has profound implications for the ability to compare QALY estimates generated in each scenario.

Tests of known-group and convergent validity revealed a consistent pattern observed for all HRQoL instruments, where each was more closely aligned with measures of internalising disorders (i.e. anxiety or depression) than that of externalising disorders (i.e. ADHD). This pattern been noted previously [15] and was apparent in our results through larger effect sizes and correlations being observed between HRQoL instruments and the RCADS-25 (measuring anxiety/depression symptoms); followed by the SDQ (measuring a combination of internalising and externalising symptoms); and lastly the SWAN (measuring ADHD symptoms). This pattern of results appears to be consistent for all HRQoL instruments, regardless of the number of mental-health-related items included in the instrument. This pattern could arise for a number of reasons: (i) the instruments are simply more valid and reliable in internalising conditions [13, 14]; (ii) children and adolescents with internalising conditions have poorer HRQoL than those with externalising conditions [4, 24], making the differences between groups larger and easier to detect; or (iii) the HRQoL instruments themselves are measuring internalising symptoms more so than externalising symptoms, making a relationship between poorer HRQoL and greater internalising symptoms a tautology [12, 42].

Further to this point, interestingly, in our sensitivity analyses using utility scores, the performance of some HRQoL instruments changed in relation to the ADHD-symptom measure. Specifically, using utility scores improved the functioning of the HUI3, which saw larger differences between known groups of ADHD symptoms, yet performance worsened for the EQ-5D-Y-3L and CHU9D, which saw smaller differences between these groups. This suggests that preference weightings (derived in different countries) that are used to generate the utility scores for each instrument, differentially weight problems that are impacted by ADHD. This highlights that the differences in performance between the instruments are ultimately a product of the measurement properties of both the descriptive systems and value sets, and can vary between instruments depending on the constructs being measured, and the characteristics of the value sets, including the valuation approaches used. This overlap of mental health and HRQoL instruments, and the types of mental health symptoms captured by HRQoL instruments when scored either by total sum scores or utility scores, warrants further attention in future research.

Strengths of the study include that it is the largest of its kind, internationally, and close control of data quality was maintained (see technical methods [25]). This provides the best evidence to date on the comparative acceptability, validity, reliability and responsiveness of paediatric generic HRQoL instruments for use in child and adolescent mental healthcare, and the first multi-instrument comparison examining responsiveness and test–retest reliability in both internalising and externalising mental health populations. An additional key strength of the study is the use of validated mental health symptom measures to assess symptom severity for use in known-group validity testing. There are limitations of the study. Children’s mental health diagnosis and changes related to this condition at follow-up were reported by their caregivers. While we were able to measure children’s anxiety/depression and ADHD symptoms with valid instruments, mental health diagnoses were not confirmed through an independent clinical diagnosis at either time point. As is common in longitudinal research, we found it difficult to identify large numbers of children with declining health at follow-up. Many children are in treatment, or have a natural history of disease that leads to improvements over time. This is an ongoing issue for validation of responsiveness for these instruments. Our findings may not be generalisable to other mental health conditions, given variable performance of HRQoL instruments has been observed across different mental health conditions in adult literature [13, 14], and now also through our study. While we made careful efforts to maximise the quality of the online sample [25], limitations arise due to the use of online panel recruitment, including the potential for sampling bias, self-selection into the survey by participants, and the inability to verify if self-report occurred or if there was parental influence in children’s self-report. It is also important to note that the ‘total performance score’ for each instrument is constructed by the authors, which may mean that overall instrument performance could be calculated and interpreted in other ways.

In summary, our results indicate that the CHU9D, PedsQL, EQ-5D-Y-3L and EQ-5D-Y-5L perform equally well on acceptability/feasibility, known-group and convergent validity. However, relative strengths of the CHU9D and PedsQL were observed regarding their lack of ceiling effects, and greater test–retest reliability. Relative strengths were also observed for the CHU9D and EQ-5D-Y-3L regarding responsiveness to improvements in health. While each instrument showed strong performance in some areas, the CHU9D and PedsQL showed the most consistent performance across all psychometric properties. Instrument performance varied across subgroups, particularly for ceiling effects, responsiveness and test–retest reliability, thus careful consideration of the choice of instrument is advised, as this may differ depending on the intended use of the data, and the age, sex, report type and type of mental health condition of the population in which the instrument is being used. In addition, the closer relationship of these HRQoL instruments with internalising symptoms compared with externalising symptoms warrants targeted attention in future research.