Cooccurring mental health disorders are more frequent in the intellectual disability (ID) population than in the general population (Einfeld et al., 2011; Munir, 2016). Mental health disorders result in reduced functioning and an increased need for help in everyday life at home, at school, or at work, in addition to difficulties due to ID (Einfeld et al., 2011; Halvorsen et al., 2019). These difficulties are associated with reduced quality of life for the person and the family (Hastings et al., 2001; Lin et al., 2009). Accordingly, careful assessment of mental health should be an essential component of care for all people with ID and should be integrated into clinical practice. The identification of mental health (MH) disorders is, however, considered difficult due to the considerable symptom overlap between ID and MH disorders and the problems of distinguishing between the conditions (Einfeld et al., 2011). Additionally, accompanying communication difficulties and atypical symptom presentations associated with more severe ID make assessment challenging (Stratis & Lecavalier, 2015). The use of relatively broadband standardized instruments is generally recommended in the initial assessment of MH disorders. There are few currently available instruments that have been specifically developed for children and adolescents with ID (e.g., Aberrant Behavior Checklist [ABC]: (Aman & Singh, 1986); Developmental Behavior Checklist [DBC]: (Einfeld & Tonge, 1992), and accordingly, instruments not originally developed for this population are commonly used (e.g., Achenbach System of Empirically Based Assessment [ASEBA]; Strengths and Difficulties Questionnaire [SDQ; Goodman, 1997]). However, there is a need for more knowledge of valid and standardized measures of MH problems among children and adolescents with ID. A previous systematic review evaluated the suitability of MH measures, in terms of psychometric properties (i.e., reliability and validity), that are commonly used for people of all age groups (i.e., children, adolescents, and adults) with severe and profound ID (Flynn et al., 2017). Flynn et al. (2017) found that very few measures were available and recommended (i.e., sound psychometric properties) for adults. Furthermore, they found no eligible studies reporting psychometric properties of instruments for children and adolescents with severe and profound ID. Accordingly, there is an urgent need for more knowledge about the reliability and validity of MH instruments used among children and adolescents across the whole ID spectrum. Such knowledge of measurement properties will provide the clinical and research field with important new knowledge regarding the strengths and weaknesses of these instruments and provide input to further developmental needs in this field.

Objective

This systematic review aimed to provide an overview of relevant general measures for assessing MH problems among children and youths with ID. More specifically, the research question was the following: What are the psychometric properties of measurement tools used to assess general MH problems in children and adolescents with ID at ages of 4–20 years? We set this age range (i) Because very few MH measurement tools have been developed for children under age four years and particularly for the ID population, and (ii) we wanted mainly children/youth samples because this was the focus of this review and including adults could provide findings that are not necessarily transferable to children.

Methods

The protocol for this systematic review was registered in PROSPERO, an international register for systematic reviews with health-related outcomes (CRD42020172186). PRISMA guidelines were used for the reporting process (Moher et al., 2009). The PRISMA checklist is available in Appendix I.

Inclusion and Exclusion Criteria

We included papers if they met the following criteria: (a) at least 70% of the sample in the study were reported as having an intellectual functioning equivalent to a full-scale intelligent quotient (FSIQ) ≤ 80 either by means of a standardized intelligence test or a diagnosis of ID or indirectly by parent report or being a pupil at a special school for children and youths with ID. (b) All studies were based on samples that included children and youths between the mean ages of 4–20 years. Samples reporting participant age above 25 years of age were excluded as the focus on this review were on children and adolescents. (c) Reported original data on quantitative or psychometric outcomes for general MH measures published in a peer-reviewed journal or as a PhD dissertation. (d) Focused on the development, adaptation, or evaluation of a measure of MH. The inclusion criteria for MH problems were derived from the International Statistical Classification of Disease and Related Health Problems, 10th Revision (World Health Organization, 2010). Eligible MH problems and their key diagnostic symptoms, with onset usually occurring during childhood and adolescence, were classified as follows: (a) F20-29: schizophrenia, schizotypal, and delusional disorders; (b) F30-39: mood (affective) disorders; (c) F40-48; neurotic, stress-related and somatoform disorders; and (d) F91-94 behavioral and emotional disorders. Accordingly, we did not include disorders of adult personality and behavior (F60-69), organic mental disorders, disorders due to psychoactive substance abuse, behavioral syndromes associated with physiological disturbances and physical factors, neurodevelopmental disorders (ID, attention-deficit/hyperactivity disorder, autism spectrum disorders, or specific developmental disorders), motor disorders (Tourette syndrome), or other behavioral and emotional disorders with onset usually occurring in childhood and adolescence that are not within F91-94 (e.g., pica or stereotyped movement disorder).

We excluded the following types of papers: (a) Published before 1980 in accordance with Flynn et al. (2017) (b) used specific MH measures with fewer than two symptom domains as the focus in this review was on broadband/general measurement tools, (c) focused on evaluating psychotropic drug interventions, or (d) reported only descriptive mean scores for ID samples (e.g., genetic syndromes) with no other psychometric information.

Search Methods for Identification of Studies

We searched Medline (Ovid), Embase (Ovid), PsycINFO (Ovid), Health and Psychosocial Instruments (Ovid), CINAHL (EBSCO), ERIC (EBSCO), and Web of Science from 1980 through February 21st, 2020. The trial registers ClinicalTrials.gov and WHO International Clinical Trials Registry Platform (ICTRP) were also searched for ongoing and unpublished trials on May 16th, 2021. An updated search for each included measurement tool was performed on March 13th, 2021.

The search strategy was developed by an information librarian (BA) using a wide range of search terms for intellectual and developmental disabilities, MH issues, children and adolescents, and psychometric properties. No limits were applied to the study design, language, or publication type. The search strategy was adapted to each database (see complete search strategies in Appendix II).

The bibliographies of all included studies and previous systematic reviews were also searched for relevant studies. We contacted experts in the field to identify additional unknown studies; four additional papers were identified in this manner, but none met the inclusion criteria (Brinkley et al., 2007; Kaat et al., 2014; Ono et al., 1996; Siegfrid, 2000).

Study Selection

All titles and abstracts were independently screened by at least two reviewers (MBH, (screened all references), BA, SBH and MM) in Covidence. All full-text papers were independently screened (always MBH, in addition to SBH, BA, MM, or PHB). Disagreements were resolved by discussion, and if needed, a third author (MM) was consulted to reach a final decision.

Data Extraction (and Synthesis)

Data were extracted into a table format by one reviewer (MBH or BA) and were checked for accuracy by a second reviewer (SBH or PHB). The extracted data included study design, country, participant demographics (age and sex) and clinical characteristics (i.e., ID severity, adaptive level, comorbid diagnosis), rater characteristics (i.e., parent/caregiver, teacher or other), and information about the data analyses/psychometric properties.

The data were summarized for all the studies reporting on each measurement tool, with a narrative synthesis.

Methodological Quality of MH Measures

As the objective was to assess the psychometric properties of the MH measures as they appeared in the studies we identified, we did not assess the quality of the methods in the included studies themselves. Originally, we planned to use the COSMIN Risk of Bias Checklist (Mokkink et al., 2018) to evaluate the psychometric properties of identified MH measures. However, we found that this tool was more suitable for assessing outcome studies. Accordingly, we chose to use the EFPA review model for the description and evaluation of psychological and educational tests (European Federation of Psychologists' Association (EFPA), 2013) to guide the assessment of the psychometric properties (Table 1). More specifically, reliability (i.e., internal consistency, test–retest reliability, and interrater reliability) and validity (i.e., criterion validity, content validity, and construct validity) were evaluated by means of the interpretation guidelines from the EFPA review model (European Federation of Psychologists' Association (EFPA), 2013) using a four-point scale (0 = not reported/not applicable; 1 = inadequate; 2 = adequate; 3 = excellent/good). We did not evaluate the measure’s reported norms. See Table 1 for more information. This quality assessment was independently conducted by MBH and SBH for 20 randomly chosen studies reporting psychometric statistics. The interrater reliability of these assessments showed an excellent degree of correspondence (r = 0.92) for the sum scores. Due to a high degree of correspondence in scoring, the remaining articles/studies (n = 29) were then randomly distributed between MBH and SBH. If uncertainty in scoring arose, this was discussed between MBH and SBH; if needed, a third author (MM) was consulted before an agreement was reached.

Table 1 Interpretation guidance from the EFPA Review Model (2013) to evaluate the psychometric quality of included measures

All studies pertaining to each individual measurement tool were then included in the overall assessment of each measure, allowing the authors to establish the weight of evidence for each measure in turn.

Results

Literature Selection

The literature searches resulted in 22,692 unique references. We excluded 20,069 after screening titles and abstracts, and we assessed 774 full-text articles, of which 725 were excluded (see Appendix III for excluded studies with exclusion reasons). A total of 49 trials/papers were ultimately included. Details of the study selection process and reasons for exclusion are provided in Fig. 1. There were very few cases where a third reviewer (MM) was required to resolve disagreements. We focused on assessment instruments of MH for children and adolescents with chronological ages of 4–20 years. Some assessment tools had additional supporting data for older age groups, but this information was not included in the current review.

Fig. 1
figure 1

PRISMA Flow Diagram

MH Instruments

A total of 49 papers reporting on 10 instruments for assessing MH problems among children and adolescents with ID were identified and included (Aman et al., 1996; Baraldi et al., 2013; Borthwick-Duffy et al., 1997; Bostrom et al., 2016; Braga et al., 2018; Brereton et al., 2006; Brown et al., 2002; Chadwick et al., 2000; Clarke et al., 2003; Coe et al., 1999; Dekker et al., 2002a, 2002b, 2002c; Dieleman et al., 2018; Douma et al., 2006; Einfeld & Tonge, 1995; El-Keshky & Emam, 2015; Embregts et al., 2010; Emerson, 2005; Esbensen et al., 2018; Freund & Reiss, 1991; Hassiotis & Turk, 2012; Hastings et al., 2001; Haynes et al., 2013; Jacola et al., 2014; Kaptein et al., 2008; Koskentausta & Almqvist, 2004; Koskentausta et al., 2004; Marshburn & Aman, 1992; Masi et al., 2002; Matson et al., 1984; Mircea et al., 2010; Murray et al., 2020; Norris & Lecavalier, 2011; Oliver et al., 2003; Oubrahim & Combalbert, 2019; Reiss & Valenti-Hein, 1994; Rice et al., 2018; Rojahn & Helsel, 1991; Rojahn et al., 2010; Sansone et al., 2012; Taffe et al., 2007; Tasse & Lecavalier, 2000; Tasse et al., 1996; Tonge et al., 1996; van Lieshout et al., 1998; Wallander et al., 2006; Wolf, 1981; Wright, 2010) (see Tables 2 and 3). Of these instruments, seven were aimed for the intellectual and developmental disability (IDD) population (i.e., ID instruments), while three instruments were not originally designed or aimed for use in this population (i.e., non-ID instruments) (Table 3).

Table 2 Overview of studies: study characteristics and psychometric data
Table 3 Description of included instruments from all studies

The included assessment instruments were intended to screen for a relatively broad spectrum of problems, so-called broadband assessment instruments. The frequency, severity and duration of target behaviors were most often used to measure MH problems. Four papers gave a first report of the development or adaptation of a new instrument (Challenging Behavior Interview [CBI] (Oliver et al., 2003); Developmental Behavior Checklist [DBC] (Einfeld & Tonge, 1995); Nisonger Child Behavior Rating Form [NCBRF] (Aman et al., 1996); Reiss Scales for Children’s Dual Diagnosis [Reiss] (Reiss & Valenti-Hein, 1994); Well-Being in Special Education Questionnaire [WellSEQ] (Bostrom et al., 2016)).

Overall, the identified instruments reported their development/framework through a widely defined bottom-up approach (i.e., descriptive-empirical approach) based on specific descriptors of children’s functioning. These individual symptoms (i.e., items) were either based on other existing questionnaires (e.g., the NCBRF was adapted from the Child Behavior Rating Form, and the SDQ was adapted from the Rutter Questionnaire) and/or based on a literature review of the field, expert consultation, or case files from IDD services (i.e., the ID instruments). The ASEBA and the latest version (e.g., Child Behavior Checklist [CBCL]) specifically reported six additional subscales based on the Diagnostic and Statistical Manual of Mental Disorders (DSM) (Achenbach & Rescorla, 2001). The ASEBA (i.e., CBCL, Teacher Rating Form [TRF], and Youth Self Report [YSR]) was the most comprehensive measure identified in terms of the number of items (i.e., 120 items) compared to the other measures (mean number of 53 items). The SDQ, on the other hand, lacked DSM-oriented subscales.

It is noteworthy that the majority of the instruments were proxy or informant-based measures with the exception of the WellSEQ (Bostrom et al., 2016), an ID instrument, and the ASEBA and SDQ, both non-ID instruments, which also offered a youth self-report form. Five papers reported using the youth self-report form (ASEBA: (Douma et al., 2006); SDQ: (Embregts et al., 2010; Emerson, 2005; Haynes et al., 2013); WellSEQ: (Bostrom et al., 2016). The other papers reported using parent/primary caregiver, teacher, and (health) care staff as informants (Table 2). All identified studies/papers reporting on instruments were in the English language. Moreover, all instruments were originally developed in English, with the exception of the WellSEQ (Bostrom et al., 2016), which was developed in the Swedish language. However, the majority of the identified measures had one or more studies that reported psychometric properties for non-English versions, with the exception of the ABC (Aman & Singh, 1986) and Behavior Problem Checklist (BPC) (Quay & Peterson, 1975, 1983) (see Table 2).

We found that most papers used the ASEBA (11 papers) followed by the DBC (10 papers) and further followed by, in descending order, the SDQ (7 papers), ABC/NCBRF (6 papers each), Behavior Problems Inventory-01 (BPI-01) (4 papers), BPC (3 papers), and CBI/Reiss/WellSEQ (1 paper each). For seven of the measures, the researcher by whom it was developed was involved in its evaluation (ABC: Brown et al., 2002; Marshburn & Aman, 1992; BPI-01: Baraldi et al., 2013; Mircea et al., 2010; Rojahn et al., 2010; CBI: Oliver et al., 2003; DBC: Brereton et al., 2006; Clark et al., 2003; Dekker et al., 2002a, b, c; Einfeld & Tonge, 1995; Taffe et al., 2007; Tong et al., 1996; NCBRF: Aman et al., 1996; Tasse et al., 1996; Rojahn et al., 2010; Tasse & Lecavalier, 2000; Reiss: Reiss & Valentin-Hein, 1994; WellSEQ: Bostrom et al., 2016).

In relation to participant samples, the vast majority included mixed samples of people with ID with the exception of pure syndrome-specific samples in alphabetical order: i) Down syndrome (Coe et al., 1999; Dieleman et al., 2018; Esbensen et al., 2018; Jacola et al., 2014); ii) Fragile X (Sansone et al., 2012), iii) Prader-Willi syndrome (van Lieshout et al., 1998) and iiii) Williams syndrome (Braga et al., 2018). The majority of the studies included samples, in which the major proportion were reported with up to a moderate ID level, with the exception of a few studies reporting a high proportion of likely more severe ID (Chadwick et al., 2000; Hastings et al., 2001; Oliver et al., 2003). It should be noted, as shown in Table 2, that in general, very few studies reported formal IQ data and/or data concerning participants’ adaptive function level. FSIQ was reported only with the ABC (two papers), ASEBA (five papers), and BPI-01/NCBRF (one paper each). In relation to sex, overall, the papers reported on samples consisting of a higher proportion of boys, with the exception of five studies (ASEBA: Braga et al., 2018; Jacola et al., 2014); BPI-01 (Mircea et al., 2010; Oubrahim & Combalbert, 2019); NCBRF: (Mircea et al., 2010)). Moreover, population-based samples were reported by six papers (ASEBA: (Wallander et al., 2006); DBC: (Dekker et al., 2002a, b, c; Einfeld & Tonge, 1995; Taffe et al., 2007; Tonge et al., 1996); SDQ: (Emerson, 2005)), and special education/school samples were reported by 19 papers (ABC: Brown et al., 2002; Chadwick et al., 2000; Marshburn & Aman, 1992); ASEBA: (Dekker et al., 2002a, b, c; Douma et al., 2006; Wright, 2010); BPC: (Matson et al., 1984; Wolf, 1981); BPI-01: (Rojahn et al., 2010); CBI: (Dekker et al., 2002a, b, c; Oliver et al., 2003); DBC: (Hastings et al., 2001); NCBRF: (Norris & Lecavalier, 2011; Rojahn et al., 2010; Tasse & Lecavalier, 2000); SDQ: (El-Keshky & Emam, 2015; Haynes et al., 2013; Kaptein et al., 2008); WellSEQ: (Bostrom et al., 2016)). The remaining papers reported on some form of community samples, specific syndrome samples (as noted above) or patient samples (Table 2). Regarding sample sizes, 30 papers reported a sample size of 100 participants or more, as shown in Table 2.

Methodological Quality of MH Measures

The quality assessment of the psychometric properties in terms of reliability and validity of the MH measures as they appeared in the papers/studies indicated overall summary scores ranging from 0% (i.e., the relevant properties not reported; (Tasse & Lecavalier, 2000)) to 89% (i.e., the majority of the properties documented in large sample and found to be good/excellent properties; (Einfeld & Tonge, 1995)) (Table 4).

Table 4 Quality assessment of instruments

All measures except the BPC and CBI were supported by evidence regarding internal consistency, and in general, the internal consistency of the scales across instruments was adequate (i.e., 0.70–0.79) to good/excellent (≥ 0.80) (see Table 1 and Method for more details). Evidence of interrater reliability was found for all measures with the exception of the CBI and Reiss; however, with few exceptions (i.e., BPI-01 and DBC), the evidence indicated inadequate agreement (i.e., < 0.60) in most instances. In terms of consistency over time, although there was no evidence found (i.e., it was not examined/reported) for the CBI, Reiss, and SDQ, all other measures were supported by adequate (i.e., 0.60-0.69) or good/excellent test–retest reliability (i.e., ≥ 0.70) with the exception of the BPC (i.e., inadequate reliability: ≤ 0.60). However, studies examining test–retest reliability were inadequate in terms of small sample sizes (N < 100), although there were some exceptions (ASEBA: Dekker et al., 2002a, b, c; Wallander et al., 2006); DBC: (Einfeld & Tonge, 1995)).

Regarding the validity of the measures, little evidence of criterion-related validity and content validity was found. Most of the studies used clinician-rated diagnosis/caseness as a criterion or examined meaningful/hypothesized group differences in subscale scores across diagnostic groups (ABC: (Rojahn & Helsel, 1991); ASEBA: (Douma et al., 2006; Koskentausta et al., 2004); DBC: (Brereton et al., 2006; Clarke et al., 2003; Dekker et al., 2002a, b, c; Einfeld & Tonge, 1995; Hassiotis & Turk, 2012; Koskentausta & Almqvist, 2004); NCBRF: (Norris & Lecavalier, 2011; Reiss & Valenti-Hein, 1994)). Regarding the SDQ, Emerson (2005) reported correspondence between subscale scores and diagnoses from a diagnostic interview (Development and Well-Being Assessment; (Goodman et al., 2000) that had not been validated for persons with ID. We identified more reports of criterion-related validity (i.e., clinician-rated diagnosis/caseness/hypothesized group differences in subscale scores across diagnostic groups) that were reported as good/excellent (i.e., ≥ 0.35) for the DBC compared to the other measures, and no evidence on this aspect for the BPC, BPI-01, CBI, and WellSEQ (Table 4). Evidence of content validity was reported for most of the ID measures (CBI, DBC, NCBRF, Reiss, and WellSEQ) and for one non-ID measure (SDQ).

The majority of measures were supported by evidence of construct validity in terms of correlations between instruments assessing similar constructs, with the exception of the BPC and Reiss, where no evidence was found. Regarding the non-ID instruments, evidence of construct validity was reported for the ASEBA and SDQ, where ID instruments were the most commonly used benchmarks (in alphabetical order: ABC, BPI-01, DBC, NCBRF, and Psychopathology Instrument for Mentally Retarded Adults) (see Table 2). Likewise, evidence of construct validity was reported for the ID instruments, where the other ID measures were most often used as benchmarks (in alphabetical order: ABC, BPI-01, DBC, and NCBRF) (Tables 2 and 4). In relation to sample sizes and reported evidence of construct validity, the NCBRF was examined in the most studies that were large enough (N > 200 in four studies), followed by the BPI-01 and DBC (both had N > 200 in two studies and N = 100–200 in one study), ASEBA (N > 200 in two studies), and SDQ (N > 200 in one study). Moreover, papers/studies examining the CBI and WellSEQ both reported evidence of construct validity using inadequate sample sizes (N < 100).

Exploratory factor analysis (EFA) was used with all measures except the CBI and WellSEQ, and these studies most often used principal component analysis (Table 2). The measure that had the factor structure (FS) examined in the most papers/studies was the ABC (five studies), followed by the DBC (four studies), NCBRF (three studies), BPI-01 (two studies), and ASEBA/BPC/Reiss (all one study each). In regard to the ABC, all studies except one reported adequate to good/excellent FS (Tables 2 and 4). In addition, Sansone et al. (2012) reported an inadequate FS in a syndrome-specific sample (Fragile X) and suggested an alternative FS, with one factor unchanged (inappropriate speech), four modified (irritability, hyperactivity, lethality/withdrawal, and stereotypy), and a new social avoidance factor. Borthwick-Duffy et al. (1997) reported an adequate FS for ASEBA–CBCL only for the broadband internalizing and externalizing factors, although the analysis was based on an inadequate sample size (N < 100). An inadequate FS was also reported for the BPC in a large study (Matson et al., 1984). The FS of the BPI-01 using confirmatory FA (CFA) was found to be good/excellent in one large study in a specialized ID institution in France and inadequate in a large special education sample in the US (Table 4). In regard to the DBC, all studies were large; most studies reported an adequate FS, and one reported a good/excellent FS, all by means of EFA (Table 4). Regarding the NCBRF, good/excellent FS was reported in one large study among outpatients using EFA, adequate FS in a large CFA study among special education students/outpatients, and inadequate FS in another large CFA study among special education students (Table 4). The only study that examined the Reiss was large and reported a good/excellent FS (Reiss & Valenti-Hein, 1994). The FS of the SDQ in terms of the broader internalizing and externalizing subscales (alongside the fifth prosocial subscale) was found to be adequate in one large study among students from Saudi Arabia and Oman using CFA (El-Keshky & Emam, 2015) and inadequate in a smaller sample (N = 128) of students from Australia examining the original five-factor structure (Table 4).

In relation to self-report, the ASEBA–YSR and WellSEQ were the only measures with evidence of reported adequate aspects of reliability and an adequate aspect of validity (see Table 4). However, the evidence was not confirmed by supporting studies.

Based on the EFPA review model (see Method), all studies examining each individual measurement tool were then included in the overall assessment of each measure (Table 5), allowing us to establish the weight of evidence for each measure.

Table 5 Summary of overall quality of the psychometric properties of each assessment

As seen in Table 5, the DBC was the only measure with at least two aspects of reliability (i.e., test–retest and interrater) assessed as good/excellent by two studies, in addition to all validity aspects assessed with evidence of good/excellent with more than one supporting study in relation to criterion and construct validity (convergent validity). The ABC, NCBRF and BPI-01 had two or more aspects of reliability and validity assessed as good/excellent, but at least two aspects of reliability and validity, each in the good/excellent range, were not confirmed by a supporting study. The non-ID measure ASEBA had two aspects of reliability assessed as good/excellent, as reported by two studies, but only convergent validity was reported as good/excellent by supporting studies, and no other validity aspect was assessed in the good/excellent range. The remaining four measures (Reiss, CBI, SDQ, WellSEQ) had no aspects of reliability assessed as good/excellent with supporting studies, although the Reiss had two aspects of validity assessed as good/excellent with no supporting study, and comparably, the CBI/SDQ/WellSEQ had one aspect of validity assessed as good/excellent. The BPC had no aspect of reliability or validity assessed as good/excellent or adequate.

Furthermore, the average psychometric quality, based on the sum score (maximum possible quality score = 35; Table 4) for each measurement tool as they were scored during the quality assessment of the studies, indicated relatively large differences. In general, quality for the ID measures (M = 12.03, SD = 7.30) was better than for the non-ID measures (M = 6.64, SD = 5.16) (Table 5). Moreover, the average psychometric quality (Table 5) based on the quality assessment sum score was quite similar among the different ID measures, although the number of studies reporting psychometric properties for each measure greatly varied (e.g., DBC in 10 papers versus WellSEQ in 1 paper). Therefore, when examining, for instance, the ID measures Reiss and WellSEQ with a relatively high average psychometric quality score, it is important to be aware that the documentation was very limited, as shown in the associated standard deviation values in Table 5.

Discussion

Careful assessment of MH is recommended among all people with ID due to the high vulnerability of this population for developing MH disorders. Our systematic review on the measurement properties of general MH instruments used with children and adolescents with ID identified documentation for ten instruments. The instruments can be divided into two main groups: instruments specifically developed or adapted for the ID population (ID instruments: Aberrant Behavior Checklist [ABC], Behavior Problems Inventory [BPI-01], Challenging Behavior Inventory [CBI], Developmental Behavior Checklist [DBC], Nisonger Child Behavior Rating Form [NCBRF], Reiss Scales for Children’s Dual Diagnosis [Reiss], and Well-Being in Special Education Questionnaire [WellSEQ]) and instruments developed for the general child population (non-ID instruments: Achenbach System of Empirically Based Assessment [ASEBA], Behavior Problem Checklist [BPC educational setting], and Strengths and Difficulties Questionnaire [SDQ]). All identified instruments were screening measures to be used in an initial assessment of MH problems. Of the identified instruments, only the ASEBA had DSM-oriented subscales. The other instruments, including the additional ASEBA subscales, were based on specific descriptors of children’s functioning (e.g., from a literature review, case files, expert consultations, or other existing measures), which were then refined through empirical results and most often from principal component analysis.

The main finding from the present systematic review was consistently better documentation of reliability and validity in terms of higher overall average quality assessment (sum) scores for the ID instruments than for the non-ID instruments. Overall, there were comparable average quality assessment sum scores among the different ID instruments in situations where we identified measures with the most papers reporting psychometric properties (i.e., ABC, BPI-01, DBC, and NCBRF). For the ID instruments CBI, Reiss, and WellSEQ, the findings were more limited due to very little documentation (i.e., one paper each reporting psychometric properties). Regarding the non-ID instruments, the ASEBA gained a higher overall quality score than the other non-ID instruments (i.e., BPC and SDQ). Nevertheless, the average overall quality score for the ASEBA was lower than that for the ID instruments ABC, BPI-01, DBC, and NCBRF.

When examining the overall assessment of each measure in more detail, the DBC was the only measure with most aspects of reliability (test–retest and interrater) and all aspects of validity (criterion, content, factor structure, and convergent validity) assessed as good/excellent, with more than one supporting study for at least two aspects of reliability and validity. The other ID instruments, the ABC, BPI-01, and NCBRF, had several aspects of reliability and validity assessed as good/excellent but fewer supporting studies than the DBC. Regarding the non-ID instruments, the ASEBA had two aspects of reliability (internal consistency and test–retest) assessed as good/excellent by two studies, however, with the exception of convergent validity, other validity aspects were not assessed as good/excellent. There was less evidence for SDQ suitability in terms of reliability and validity compared to ASEBA suitability. Based on the documentation identified for the BPC (i.e., no aspects assessed as good/excellent or adequate), we would not recommend the continued use of this instrument in its current form for this population, and this is probably reflected by the most recent identified study using the BPC being 22 years old (Coe et al., 1999). Regarding documentation of construct validity, in terms of correlations between instruments measuring similar constructs (convergent validity), most of the studies using non-ID measures (e.g., the ASEBA) used ID instruments as benchmarks. Moreover, documentation of construct validity in terms of factor structure was limited for both the non-ID measures ASEBA and SDQ, and these analyses favored the use of the broadband scales (i.e., internalizing and externalizing scales) over the more specific subscales.

It is important to emphasize that the vast majority of studies reporting psychometric properties in the present systematic review involved samples primarily consisting of children and adolescents with a borderline to moderate ID level. This finding is consistent with the findings from a relatively recent systematic review among people of all age groups with severe or profound ID, which found no eligible studies (i.e., at least 70% of the sample within a severe/profound ID level) reporting psychometric properties of measures for children and adolescents (Flynn et al., 2017). Whether the various instruments are suitable for children and adolescents with severe and profound ID is therefore largely unknown and should be investigated in future studies. Furthermore, regarding the ID status of the participants, the majority of the studies in the present systematic review used an administrative operationalization of ID status (e.g., school placement). Accordingly, with few exceptions, a formal IQ assessment or adaptive assessment was not conducted or reported. An implication of this is that the ID concept/condition was loosely defined; therefore, we cannot rule out that the studies included children and adolescents who would not qualify for a formal ID diagnosis.

The perspective of the child or adolescent in terms of self-report measures was very limited, as identified in the present review. The vast majority of studies used informant-based measures completed by parents/caregivers, teachers and (health) care staff. We identified three self-report measures (i.e., the ASEBA, SDQ, and WellSEQ), in which the non-ID measure ASEBA–YSR and the ID measure WellSEQ reported very limited data indicating adequate reliability and validity mainly in samples with mild ID (Bostrom et al., 2016; Douma et al., 2006). However, the evidence was not confirmed by supporting studies. Further development and refinement of the usage of self-reporting, if possible, will be an important development area for the field. The use of multiple informants, including the youths themselves, is recommended, as individuals who have difficulties conveying information on symptoms verbally may display these in varying ways, and no single informant is likely to have a complete overview of another person’s life (Stratis & Lecavalier, 2015). The heterogeneity of ID suggests that a single measure able to identify MH problems across the ID population is unlikely to be constructed in the near future, thereby underscoring the importance of individualized, multimodal and multi-informant approaches to assessment (Halvorsen et al., 2022). MH assessment is recommended by the use of standardized measures where the clinician also considers the strengths and weaknesses of the instrument, which has been the focus of the current systematic review.

Our findings should be interpreted in the context of the strengths and limitations of the study. To our knowledge, this review is the first recent systematic review to examine the psychometric properties of measurement tools used to assess general MH problems in children and adolescents across the whole ID spectrum. We did not limit the review to ID instruments only, as the field (clinic and research community) is characterized by the use of ID and non-ID instruments. We limited the review to studies using mainly children/adolescent samples and did not include findings from studies that used mixed-age samples that also included adults above 25 years of age. We chose to do so because a mixed age range than includes adults can provide findings that are not necessarily transferable to children. Additionally, studies that reported only prevalence rates (or instrument mean scores of MH problems/disorders) in children and adolescents with ID were not eligible because they did not report psychometric properties. We did not evaluate the measure's norms as norming data for the measures was only reported by two of the studies (ABC: Marshburn & Aman, 1992; NCBRF; Tasse et al., 1996), and norms for measures published in manuals through publishers were not included. Another limitation of the study was that we did not calculate inter-rater reliability for the full-text review and data extraction. We did however calculate inter-rater reliability for the quality assessment for 20 randomly chosen studies reporting psychometric properties, and these assessments showed an excellent degree of correspondence (r = 0.92) for the sum scores. Finally, the vast majority of the identified measures had one or more studies that reported psychometric properties for the non-English versions. It is important for future studies to establish linguistic equivalence and to determine consistency of the measure’s psychometric properties.

Conclusion

This systematic review contributes to the field of MH assessment among children and adolescents with ID by examining the psychometric properties of measurement tools used to assess general MH problems in children and adolescents with a borderline to moderate level of ID. Our findings support the use of standardized ID instruments as the first choice in an initial assessment. Very few self-report measures have been developed for children and adolescents with ID, and very few studies have examined their suitability. How to integrate the youths’ perspectives in assessing MH problems is an important focus area in the future.