Overview of Ten Child Mental Health Clinical Outcome Measures: Testing of Psychometric Properties with Diverse Client Populations in the U.S.

While many standardized assessment measures exist to track child mental health treatment outcomes, the degree to which such tools have been adequately tested for reliability and validity across race, ethnicity, and class is uneven. This paper examines the corpus of published tests of psychometric properties for the ten standardized measures used in U.S. child outpatient care, with focus on breadth of testing across these domains. Our goal is to assist care providers, researchers, and legislators in understanding how cultural mismatch impacts measurement accuracy and how to select tools appropriate to the characteristics of their client populations. We also highlight avenues of needed research for measures that are in common use. The list of measures was compiled from (1) U.S. state Department of Mental Health websites; (2) a survey of California county behavioral health agency directors; and (3) exploratory literature scans of published research. Ten measures met inclusion criteria; for each one a systematic review of psychometrics literature was conducted. Diversity of participant research samples was examined as well as differences in reliability and validity by gender, race or ethnicity, and socio-economic class. All measures showed adequate reliability and validity, however half lacked diverse testing across all three domains and all lacked testing with Asian American/Pacific Islander and Native American children. ASEBA, PSC, and SDQ had the broadest testing. Supplementary Information The online version contains supplementary material available at 10.1007/s10488-021-01157-z.

While many standardized assessment tools have been developed to track child mental health treatment outcomes on the individual and aggregate level, the degree of testing for reliability and validity across race, ethnicity and class is uneven. Nearly two thirds (62.1%) of children who received treatment through the national Children's Mental Health Initiative (CMHI) from 2013 to 2017 were racial or ethnic minorities or biracial, and 71.3% came from families at or below the federal poverty threshold (SAMHSA, 2020). For this reason, it is crucial that clinicians have the information to select measures with proven reliability and validity for their clients.
This paper reviews the published psychometrics literature on the most commonly used standardized outcome measures in use for U.S. outpatient community child mental health care, with two aims: 1. To assist clinicians and policy makers in making informed decisions when selecting a standardized outcome measurement system. 1 3 2. To highlight needed avenues of additional testing for commonly used measures, to ensure suitability for diverse client populations.
To these ends, this paper presents a summary of the psychometrics tests for each of the investigated measures, focusing on disparities across gender, class, and race and ethnicity, followed by a comparison of the entire corpus to examine which populations are systematically overlooked across studies. We also briefly discuss the importance of taking into account differences in community background (whether these be race or ethnic group, class, or immigrant culture) and highlight specific ways such differences can impact measurement accuracy, as well as recommended readings for clinicians on the topic of culturally congruent care.
The literature scans for this review were conducted as part of a contracted project with the California Department of Health Care Services (DHCS).

Background
Reducing disparities in the access to, and quality of, child mental health care has long been identified as a national priority area (Perou et al., 2013;US Department of Health and Human Services et al., 2000). The U.S. Department of Health and Human Services' National Quality Strategy envisions that quality improvement is driven by linking recommended care processes to meaningful clinical outcomes, as well as aligning financial incentives to promote effective care (AHRQ, 2016). The Patient Protection and Affordable Care Act was passed in 2010 (42 U.S.C. § 18,001 et seq.). Although California legislated a mandate in 2011 to develop a performance outcome system for children (SB 1009;California Legislative Analyst's Office, 2011), the state's development of data infrastructures to monitor quality and detect disparities have considerably lagged behind national recommendations (Gardner & Kelleher, 2017;Glied et al., 2015;Patel et al., 2015;Pincus, 2012;Zima et al., 2013).
In 2016, the California Department of Health Care Services (DHCS) contracted our university to help them address the question, "What is the best statewide approach to evaluate functional status for children/youth that are served by the California public specialty mental health service system?" (DHCS, 2015, p. 6). To assist the CA DHCS in developing their outcome monitoring infrastructure, we created a list of all standardized measures in common use for tracking overall child mental health outcomes in the U.S., and ranked them on nine minimum criteria (Pourat et al., 2017;Zima et al., 2019). In this paper, we will lay out the findings from two rounds of systematic literature scans, and examine the breadth and diversity of psychometric testing on each candidate measure, namely: (1) published evidence for its use as a clinical outcome measure; (2) its psychometric properties and variation among diverse study populations; and (3) whether the existing published evidence adequately included children of diverse genders, classes, ethnicities and races. Details about demographics, sample size, etc., for each of the cited studies can be found in the tables. Results are divided into two sections: first the analyses of individual measures, then an aggregate analysis of the entire corpus for systematic patterns and gaps.
The Discussion section examines the significance of these results, particularly the systematic under-representation of Native Americans and Asian Americans. We also discuss the importance of clinicians' having an understanding of cultural and linguistic differences across class, race, and ethnicity (in addition to cultural differences for immigrant families), and guide clinicians in recognizing specific types of misreporting that can occur if a measure is not properly developed for a particular population. Finally, we examine the pros and cons of three possible solutions: creating adjusted scoring guides; using internationally developed questionnaires for immigrant children; and developing measures specifically for under-represented populations.

Compiling a List of Candidate Measures
The first step in this project was to create a pool of all eligible measures in common use in the U.S. that might meet the DHCS's needs, after which we would research each individual measure more closely. We conducted three initial investigations: (1) an exploratory 5-year literature scan; (2) an environmental scan of U.S. state Department of Mental Health agency websites, and (3) two statewide California surveys. Additionally, (4) DHCS requested information on four measures that had been recommended to them by other sources. Table 1 lists which criteria were met for each measure.

Exploratory Literature Scan
Systematic searches of PubMed, PsycInfo, and Scopus were conducted for peer-reviewed journal articles published between January 2010 and December 2015, with English language abstracts, that examined children ages 0-18 years in U.S. community-based outpatient care. 1 (Further details on article inclusion and exclusion criteria are in Supplemental Table 1, the 5-year cutoff was deliberately chosen to capture measures in current use). A list was compiled of all the standardized measures that were used in these studies as data collection tools to track outcomes, resulting in approximately 225 child clinical outcome measures in 127 unique articles. Thirty-four of these measures appeared in three or more articles. 2 Of these, twenty-one were specific to a single diagnosis or condition (e.g. ADHD), three were general health or quality of life measures, and three were excluded for other reasons, 3 leaving seven candidate measures.

Measures in Common Use (Nationally or in California) or of Interest to the DHCS
An environmental scan of state Department of Mental Health agency websites was conducted to determine which measures were in common use across the U.S. Thirty-five states listed at least one standardized assessment measure, for a total of 36 different measures (Pourat et al., 2016b).
To determine which measures were in common use in California, we conducted a statewide survey of county behavioral health agency directors (56 counties), and a second convenience sample of outpatient clinic staff (21 responses) which yielded seven eligible measures that were used in more than one county (Pourat et al., 2016a). Finally, the DHCS Subject Matter Experts team requested we also investigate four measures that had been recommended to them (three of which also appeared in the county and state lists).

Required Scope of Candidate Measures
To align with the priorities of our DHCS agency partners, the list was narrowed to measures that had the following characteristics: (1) track overall behavioral or emotional health (i.e. not specific to a single diagnosis such as depression); (2) are designed for children 5-16 years; (3) have been normed or tested with children in the U.S.; and (4) produce quantifiable scores that can be used to compare treatment outcomes of different patients, or can be aggregated to compare the quality of care of different service provides. The final list consisted of all measures meeting those criteria that also met at least one of the following use criteria: (1) appeared in at least 3 studies in the exploratory literature scan (7 measures); (2) was reported by at least 2 California counties (4 measures); or (3) was recommended on at least 2 state DMH webpages (6 measures). One measure on the DHCS interest list did not meet any of the other criteria (TOP); it was also included. This yielded a total of 10 measures flagged for further investigation, listed in Table 1.

Examination of Psychometric Properties and Capacity to Identify Disparities
After the candidate list was compiled, ten systematic literature reviews (one for each measure) were conducted to examine psychometric properties and suitability for diverse communities. For each measure, a Web of Science search was made of published articles with English language abstracts from the measure's initial development through March 2019, that tested reliability and/or validity with children under 18 years of age for either symptoms or functioning. 4 (Further details on article inclusion and exclusion criteria are in Supplemental Table 2) Literature reviews and meta-analyses were manually examined for additional citations. Further citations were obtained from measure developers' or vendors' webpages (if they existed) as well as from articles recommended for inclusion by a DHCS-selected team of Subject Matter Experts.
Studies that focused on specific demographics (e.g. by ethnicity or socioeconomic status) were included, as well as studies focusing on populations that might be recipients of outpatient care in California (e.g. immigrants or adopted children). Studies that used non-English versions of the measure were included if they used an independent metric to test reliability and validity (i.e. were not simply comparing a translation to the English language original).
We examined the characteristics of the research participants in each study and across the entire corpus, as well as noting reported differences in psychometric properties by race or ethnicity, gender, and class or SES. For ethnicity and gender, the researchers' own categories were used. To determine class diversity, we looked for: explicit mention of SES or household income, more general class labels (e.g. "working-class," "upper middle class," "poor"), eligibility for financial services or other aid (e.g. Medicaid or free school lunches), or enrollment in programs specifically designed for low-income families (e.g. HeadStart).

Results for Individual Measures
The final list of candidate measures by selection criteria are summarized in Table 1. The candidate measures were Achenbach System of Empirically Based Assessment (ASEBA); Child and Adolescent Functional Assessment Scale (CAFAS); Child and Adolescent Needs and Strengths (CANS); Children's Global Assessment Scale (CGAS); Clinical Global Impressions Scale (CGI); Ohio Youth Problem, Functioning and Satisfaction Scales (Ohio); Pediatric Symptom Checklist (PSC); Strengths and Difficulties Questionnaire (SDQ); Treatment Outcome Package (TOP); and Youth Outcome Questionnaire (Y-OQ).

Use as a Clinical Outcome Measure in Community-based Mental Health Programs
Findings from the exploratory literature scan are summarized in Table 2. The three measures most frequently used to track clinical outcomes among children receiving community-based mental health care were ASEBA (21 studies), 5 CGAS (14 studies), and SDQ (13 studies). Five measures were only used in the U.S. (CAFAS, CANS, Ohio, PSC, and Y-OQ), while four were also used internationally (ASEBA, CGAS, CGI, and SDQ). TOP was added to the list of measures to investigate at the request of DHCS, however it did not appear in the literature scan. One fifth (13 of 57) of the studies combined multiple candidate measures, most frequently CGAS with either SDQ (4 studies) or CGI (3 studies). Other measures used in combination were: CGI (6 of 8 studies), SDQ (6 of 13), PSC (2 of 2), ASEBA (3 of 21), CAFAS (1 of 2), and Ohio (1 of 4). All measures were applied to children with a diverse range of mental health conditions including: general use across psychiatric conditions (18 studies), broad categories such as behavioral or emotional problems (6 studies) or trauma (7 studies); or specific diagnoses such as anxiety (7 studies) or ADHD (5 studies).
All ten measures were designed for wide age ranges and covered at minimum 5-18 years. Despite this, over one quarter of the studies (15 of 57) used a measure for children outside the recommended age range.         Other measures column only includes other measures examined in this paper. Follow-up intervals are between consecutive uses (e.g. a study applying a measure at 6, 12, and 18 months has a 6 month interval) * BPC = Brief Problem Checklist, a 3rd party measure adapted from CBCL/YSR ** Measure was administered at each session, however the published article only compared the scores at baseline and 25 weeks *** Dura-Vila et al. (2013) (labeled as reference "e") did not specify whether they used a translated SDQ or had an interpreter verbally translate the English measure for the family. Since SDQ is available in all of the listed languages, we will assume the former Measure & references   Follow-up intervals varied extensively and did not consistently correspond to the measure's recommended use. One fifth (12) of the studies used archival data from existing clinic records, illustrating the feasibility of using the measure in clinical practice, but also indicating that clinics do not always track their clients' outcomes at consistent, regular intervals. Similarly, one third (21 studies) administered the measure only at intake and end of treatment or patient discharge, which led to variation in episode of care across their data set.

Sample Diversity and Psychometric Properties of Candidate Measures
The following sections summarize the diversity of study samples and overall psychometrics for each individual measure. Sample characteristics for each study are summarized in Table 3. Measures are listed in order of number of published studies. Most had less than five studies testing reliability and validity with U.S. children, however PSC (23 studies), ASEBA (22) and SDQ (12) were more extensively tested.

Sample Diversity
Studies included diversity across ethnicity, culture (including immigrant children), class and gender. One third (7 studies) were predominately (over 60%) White, while half were either mixed (5) or predominately African American (3) or Latino (4). Four studies did not list ethnicity. Two thirds of the studies recruited low-income or Medicaid-receiving participants (9) or used a mixed-class sample (6). Three were mostly middle class, and four did not provide information. Two studies focused on foster youth and five focused on Spanish-dominant parents. Three quarters of the community samples (14 of 18) were recruited via pediatric primary care.
Results suggest using a lower clinical cutoff for disadvantaged families (Simonian & Tarnowski, 2001) and children of Latino immigrants (Jutte et al., 2003). Murphy et al. (1996) found that Mexican immigrant parents scored their children slightly higher when answering the PSC orally than when filling out the written form, suggesting that they are more likely to describe problems verbally. Pagano et al. (1996) and Jutte et al. (2003) also validated PSC for Spanish speaking parents.
Validity results for low-income and minority children were also mixed, as discussed below. Gender results were mixed: Leiner et al. (2007) found no significant gender differences for Mexican families while Boothroyd and Armstrong (2010) found small to moderate gender effect in a mixed-ethnicity sample.
For children in foster care, PSC showed slightly lower test-retest reliability , moderate convergent validity , and mixed results for discriminant validity Parker et al., 2019).
Validity for low-income and minority children is mixed: Earlier studies supported the validity of PSC with African American and low-income children (Murphy et al., 1992) and showed comparable validity and reliability compared to middle class children (Jellinek et al., 1986;. However, Kostanecka et al. (2008) found PSC-17's externalizing and attention subscales to have low discriminant validity with their low-income, predominately African American sample.
PSC showed high specificity compared to pediatrician ratings: Jellinek et al., (1988Jellinek et al., ( , 1995 found that children who had experienced high stress might meet clinical criteria on PSC even when rated as functional by pediatricians. However, they also found an overall trend of pediatricians underdetecting problems when compared to child psychologists, particularly for low-income families (Jellinek et al., 1995).

Sample Diversity
Studies included diversity across class and gender although half of the samples (12 out of 25) had larger proportions          Half the studies (13 studies, 14 samples) included class information. Of these, the community samples were either middle class (3 from 2 studies) or mixed (4 from 5 studies), while the samples of children receiving mental health care were predominately lower income (5 of 7). ASEBA was also tested with families with special needs adopted children (Tharinger et al., 1986) and military families (Jensen et al., 1993(Jensen et al., , 1996. Two studies validated CBCL for Spanish speaking Puerto Rican parents: Rubio Stipec et al. (1990), who used their own translation, and Bird et al. (1987a). Konold et al. (2004) found no effect of child/parent gender, but others found that the Attention Problems subscale correlated with internalizing conditions for girls but externalizing ones for boys (Song et al., 1994), and that ASEBA had trouble distinguishing between girls' anxiety versus depression (Ebesutani et al., 2010). An early version of CBCL showed poor factor model fit for African American families (Jastrowski Mano et al., 2009) with a mismatch between parent-reported problems and CBCL's list of problem items (Lambert et al., 2002). These issues are not mentioned in later studies and may have been fixed.
Discriminant validity is supported, although CBCL has higher specificity (true negatives) than sensitivity (true positives) (Rishel et al., 2005). Some early studies found correlations between subscales but they were still distinguishable (Dedrick et al., 1997;Nakamura et al., 2009), and recent research shows subscales are good to fair at distinguishing between the condition targeted by the subscale and other conditions (Ebesutani et al., 2010) although poor at more detailed distinctions such as type of anxiety disorder (Knepley et al., 2019), especially for internalizing conditions (Jensen et al., 1993).

Sample Diversity
Gender balance was even across most samples. Three samples were primarily White, three were ethnically mixed, and three focused on specific ethnic groups: African Americans (Jee et al., 2011), Chinese and Korean immigrants (Yu et al., 2016), and Latinos (Downs et al., 2012). The other four samples (3 studies) did not include information on race or ethnicity. Half the studies did not mention class/income (6 studies, 7 samples), the others were either mixed or predominately low-income families. One study focused on foster youth (Jee et al., 2011).
Two studies examined Spanish dominant families: Downs et al. (2012) compared English and Spanish-speaking preschoolers, while a quarter of Hill and Hughes' (2007) parent study sample were bilingual or limited English proficiency. 6 Only two studies recruited children receiving mental health treatment: psychiatric inpatients (Kovacs & Sharp, 2014) and youth in residential care (Mason et al., 2012). The rest used community samples, although two focused on children who were flagged as "at risk" based on SES (Downs et al., 2012) or low literacy (Hill & Hughes, 2007).
Care should be taken to use the culturally-normed cutoff scores available on the developer's webpage, particularly with immigrant families (Dickey & Blumberg, 2004;Downs et al., 2012) as three studies found cultural differences in the parent-rated Conduct Problems and Peer Problems subscales. In a cross-national study, American parents had several items correlate more strongly with the Hyperactivity and Emotional Problems subscales compared to British parents, suggesting cultural differences in interpretations of child behavior (Dickey & Blumberg, 2004) or measure responses. Yu et al. (2016) found low reliability for Chinese and Korean immigrant parents, and low convergent and discriminant validity on the Hyperactive/Inattentive scale. Finally, Downs et al. (2012) found the Emotional Problems subscale to be suitable for Spanish-speaking U.S. preschool boys, but inadequate for girls.
For foster youth, sensitivity was high compared to CHIPS (93%) when youth and foster parent reports were combined, but lower for each one alone (54% for youth, 71% for foster parents) (Jee et al., 2011).

Psychometric Properties
Internal consistency was high for Total Score and moderate to high for most subscales except for Peers and Conduct and the already mentioned cultural issues (Bourdon et al., 2005;Downs et al., 2012;Yu et al., 2016). SDQ showed moderate to high correlation with CBCL (Kovacs & Sharp, 2014) over time (Mason et al., 2012) and strong correlation with reports of service use (Bourdon et al., 2005) and other variables known to be predictive of mental health problems, e.g. low SES (Bourdon et al., 2005).

Child and Adolescent Functional Assessment Scale-CAFAS (n = 4 Articles; 3 Samples)
Sample Diversity Murphy et al. (1999) recruited a predominately Hispanic community sample of English and Spanish speaking lowincome pre-school children enrolled in a Head Start program. 7 Hodges and Wong (1996) recruited predominately White children of army personnel who had been referred for mental health services; they also compared ratings for fictional written patient vignettes. Bates et al. (2006) collected clinician and student ratings of individual questions items. Francis et al. (2012) compared CAFAS and GAF (the adult version of CGAS) for a multi-ethnic sample of adolescents referred for mental health evaluation, but did not present any conclusions regarding the accuracy of either measure.

Psychometric Properties
Inter-rater reliability was moderate to high for written vignettes for the school-age CAFAS (Hodges & Wong, 1996) and for parental reports for the preschool PECFAS (Murphy et al., 1999). There is strong correlation with reported problematic behaviors, poor academic performance, and teacher rating of psychosocial problems (Hodges & Wong, 1996;Murphy et al., 1999). The only construct validity tests were based on graduate student ratings of individual items (Bates et al., 2006). In a comparison with CAFAS, GAF identified roughly equal proportions of functional impairment for youth with externalizing versus internalizing diagnoses, while CAFAS identified twice as many externalizing cases compared to internalizing ones (Francis et al., 2012).

Sample Diversity
One sample was all male, two were at least two thirds male, and two were balanced gender. Two studies (3 samples) included information on ethnicity: Francis et al. (2012) used a multi-ethnic sample to compare the older GAF version with CAFAS (see above) and Bird et al. (1987b) recruited clinic-referred and community samples of Spanish speaking Puerto Rican children. Two studies used predominately working-class samples, the others did not list such information. Green et al. (1994) extracted ratings from clinical records of psychiatric inpatients, they found different patterns from other studies (see below).

Psychometric Properties
CGAS showed consistently high inter-rater reliability (Bird et al., 1987b;Shaffer et al., 1983) even among raters with different types of experience, such as psychiatrists versus nurses (Green et al., 1994). It also showed high discriminant validity between clinic and community populations (Bird et al., 1987b), and between inpatients and outpatients (Shaffer et al., 1983). While Bird et al. (1987b) found high concurrent validity with CBCL, Green et al. (1994) found no significant correlation between CGAS rating and symptomatology measures (the CBCL Behavior Problems subscale) but did find strong correlations with measures of functioning (e.g. the CBCL Activities and School subscales, WISC-R IQ scale, and measures of social relatedness), and these were stronger for higher functioning children. Green and colleagues explain the discrepancy in results by arguing that their study used clinical records, in contrast to more controlled studies (such as Bird's), which either used homogenous structured data (e.g. written vignettes or videos) 1 3 or tested with raters who were not involved in the child's treatment. In both cases, study raters would be focusing on different features than would clinicians and staff in normal practice. Additionally, they add, CBCL is normed on nonclinical samples which, by definition, would have a broader range of scores than the psychiatric inpatient samples that Green's group looked at. Francis et al. (2012) also found discrepancies between GAF and CAFAS (discussed above) but did not investigate the cause.

Sample Diversity
The short form was tested with predominately Hispanic children; the other two studies were primarily (67-78%) White. Participants in one study were Medicaid recipients (Anderson et al., 2003), income status for the others was unknown. All studies focused on children receiving mental health care; there were no community comparison samples. Worth noting, Alamdari and Kelber used Y-OQ to test concurrent validity; as we will discuss below, Y-OQ has not been adequately tested with non-White children.

Psychometric Properties
According to Rosanbalm et al. (2016), CANS was not designed using a psychometric approach and was originally intended to be used only for individuals, not aggregated.
Three different variations of CANS were tested: a mental health scale (Anderson et al., 2003), a short form (Alamdari & Kelber, 2016), and a trauma screener (Kisiel et al., 2018). Anderson et al. (2003) found high interrater reliability between caseworkers and researchers on the mental health scale but did not address validity. The short form showed good concurrent validity with Y-OQ's somatic and behavior dysfunction subscales (Alamdari & Kelber, 2016), while the trauma screener showed good convergent and discriminant validity when compared to TSCC-A and CBCL (Kisiel et al., 2018).

Sample Diversity
There was little information on sample characteristics: three of the five community samples were 80% White, two were also predominately middle class. There was no information on ethnicity or class for the others. One community sample was over two thirds female, and four clinic samples from the same study were predominately male; the rest were all equally balanced.

Psychometric Properties
Y-OQ showed high internal consistency across community and clinical samples for Total Score, and moderate to high for subscale scores, as well as moderate to high correlation with CBCL, the Connors Parent Rating Scale (Burlingame et al., 2001), and YSR (Ridge et al., 2009). Burlingame et al. (2001) tested discriminant validity via a sensitivity analysis using a combined sample recruited from clinics, schools, and the general community. Their results suggest Y-OQ can differentiate non-clinical, outpatient, and inpatient children. Test-retest reliability was high when tested on non-clinical samples in all three studies.

Sample Diversity
Both studies provided sparse information: Dowell and Ogles (2008) (2 samples) recruited predominately Caucasian families, while Ogles et al. (2001) (6 samples) provided no information on race or ethnicity. There was no information on SES or class. Three samples were predominately male, three were balanced, two did not include gender information.

Psychometric Properties
Internal consistency and test-retest reliability were high (Dowell & Ogles, 2008;Ogles et al., 2001), with moderate to high correlations with CBCL (parent), Vanderbilt Functioning Index (parent and caseworker), YSR (youth), and with CAFAS, CGAS and the Progress Evaluation Scales (all caseworker reports) (Ogles et al., 2001). CAFAS and CGAS both have fairly little testing with non-White children (CAFAS with Latino preschoolers and CGAS only with Puerto Ricans); both are discussed elsewhere in this paper. Moderate correlation with the BASC was found for a community sample, but no statistically significant correlation for the corresponding service client sample (Dowell & Ogles, 2008).

Sample Diversity
Community participants were given anonymous packets to fill out and return by mail, no details about ethnicity or class were collected. There were no studies testing TOP with children receiving mental health care.

Psychometric Properties
Moderate to high correlations between TOP's subscales and equivalent CBCL and SDQ subscales; no information on reliability.

Clinical Global Impressions Scale-CGI (n = 0 Articles)
There were no studies testing CGI on U.S. child populations, even though it is in use as a child mental health measure-it appeared in 8 studies in our exploratory scan-and has been used as a benchmark to test other measures developed for children (e.g. the Obsessive-Compulsive Inventory-Child Version, see McGuire et al., 2019). This gap in testing may be partly because CGI was originally developed for schizophrenia (Guy, 1976) and only later adopted as a general measure of functioning. Regardless of the reason, we mention it in this review because its reliability and validity need to be examined with child populations if it is to be used as a general mental health measure.

Extracting Data from Clinical Records to Monitor Quality of Care
In the exploratory literature scan, at least 12 of 57 studies pulled data from existing clinical records. While these records confirm feasibility of the measures' use in usual care practice for community-based outpatient settings, they also reveal that irregular follow-up intervals are common, for example administering only at intake and discharge. This can complicate comparing patients or aggregating results, and can also make it difficult to analyze effectiveness of care for children who have dropped out mid-treatment. Archival records therefore may not be sufficient for state or county level quality monitoring.

Measures that were Used as Benchmarks for Psychometrics Testing
Across the 73 psychometrics articles, five of the candidate measures (ASEBA, CGAS, CAFAS, SDQ and Y-OQ) were used as benchmarks in psychometrics tests of other measures, another indicator of their popularity. The ASEBA package appeared in 15 articles and was used as a benchmark for all the other measures except CANS. CGAS appeared in 8 articles to test CBCL, Ohio, and PSC. The other three each appeared once: CAFAS for Ohio, SDQ for TOP, and Y-OQ for CANS. 8 All except Y-OQ have some testing with minority (African American or Latino) families, as well as with low income or working-class families.

Breadth and Limitations of Psychometric Testing by Sociodemographic Characteristics
Examining the entire corpus of psychometric studies reveals systematic gaps in the populations with whom the measures are being tested, namely: a tendency toward male-dominated samples in clinic populations; lack of representation of Asian American, Pacific Islander, and Native American children; and lack of examination of differences across social class or SES. For information on the status of each measure, see Table 4.

Testing by Gender
Almost all study samples 9 (83 of 89; 93.3%) included participant gender. Of these, over half (48; 57.8%) had a roughly even gender balance, 10 while over a third (31; 37.3%) were all or mostly male, including five all-male samples. In contrast, only four samples (4.8%) were mostly female. Two thirds (32 of 52; 61.5%) of the community (i.e. "control") samples were gender balanced, however when examining samples recruited from clinics or other contexts where children had been referred for emotional or behavioral problems, we found that over half of them (20 of 37; 54%) were male-dominant.

Testing by Race and Ethnicity
While only two thirds of samples tracked participant race or ethnicity (61 of 89; 68.5%), most published studies (58 of 72; 11 80%) provided information for at least one sample. Of the 61 samples for which information was provided, nearly half (27; 44.3%) were predominately White, a third (19; 31.5%) were predominately non-White and a quarter (15; 24.5%) were mixed. 8 Two additional studies compared measures without using one as a benchmark for the other: (Francis et al., 2012) compared CAFAS and GAF while (Sheldrick et al., 2015) compared SDQ and ASEBA. 9 Several studies included multiple different participant samples with different demographics. See Table 3 for further details. 10 "Roughly balanced" was defined as no more than 55% of a single gender (including any missing data). 11 One study did not conduct any tests with families (Bates et al., 2006).

3
The most frequently listed non-White categories were African American or Black, Hispanic or Latino (sometimes as separate classifications), and mixed-race. In contrast, Asian Americans, Pacific Islanders, and Native Americans were not even mentioned in many of the studies. Only three studies had samples with more than 10% of A/PI-heritage participants:

Testing by Class and Socioeconomic Status
Half of the study samples (45 of 89), and nearly half of the published studies (31 of 72, 43%), had no information on participants' socioeconomic status, class, or household income. Of the 44 samples that did, almost half focused on working class or low-income demographics (21, 47.7%; or 23.6% of total corpus) while over a quarter used mixed-class samples (12, 27.3%; or 13.4% of total corpus). This combination of facts raises some concerns of possible selection or sampling bias, i.e. that the researchers who include information on social class tend to be those who are intentionally recruiting low-income participants or creating mixed class samples. Studies that do not report such information may be more class-homogeneous or less representative, particularly in older studies as health research has suffered from a lack of data collection on economic status or social class (see e.g. Krieger & Fee, 1994). Of course, it is impossible to say for certain since we only have access to what is in the published articles, but caution should be taken when generalizing from these studies.

Testing by Linguistic and Cultural Diversity
The majority (83%) of the 72 studies used the original English version of the measure. Eleven studies (for 5 measures) included Spanish-speaking families, and one study focused on Chinese and Korean-speaking immigrant parents. Table 4 presents the breakdown by measure, and the individual studies are cited in Table 3. For this review, we did not examine studies that compared how well a translation matched the original English-language measure because these would not provide information about the accuracy and quality of the measure itself. However, we encourage clinicians working with limited English proficiency families to seek out such studies (e.g. Stolk et al.'s, 2017 literature review

Other Participant Characteristics
Most of the measures were tested on both community samples and samples of children diagnosed with, or receiving care for, a mental or behavioral health condition. The exceptions were CANS, which was only tested on mental health care clients, TOP, which was only tested on an anonymous community sample, and CGI, which, as mentioned, was not tested on children at all. Some studies examined more specific populations such as special needs adopted children (Tyson et al., 2011) or children of military personnel (Hodges & Wong, 1996), details on these can be found in Table 3.

Discussion
This discussion will address two issues that emerged in our study: the specific absence of Native American and Asian American families in psychometrics testing, and the more general lack of diversity testing for several measures in popular use. We will examine how cultural and linguistic differences (including those emerging from racial or ethnic, class, or geographic differences) can lead to errors on standardized measures, discuss the pros and cons of possible solutions (including norming of cutoff scores), and provide two examples of measures that were intentionally designed for Native American youth.

Degree of Testing of Psychometric Properties Across Diverse Populations
As discussed above, published evidence supporting reliability and validity were found for all measures except CGI, however the number of studies and the diversity of participants varied widely. The number of publications for each measure varied widely, however this number does not indicate a measure's quality or the thoroughness of the testing. Once the initial validity and reliability tests have been published, there is little incentive to publish replicated results unless a significantly different new version is released or unless the author can provide something new that was not in the original publication, such as applying it in a different setting or with a different population. 12 While the two measures with the most publications (ASEBA and PSC, with over 20 studies each) did have testing across a more diverse population, overall breadth of testing is more important than number of articles, and here many of the measures fell short.
Three measures (PSC, SDQ and ASEBA) were tested across diverse genders, across SES or class, and with both Latino and African American as well as White children, however none were adequately tested with Native American or US-born Asian American children. Of these three measures, only ASEBA and PSC had breadth of testing across both community populations and diagnosed (or incare) children. Table 4 presents a comparison of gaps in testing for each measure. The popularity of CANS in particular (used in over half of California counties, see Pourat et al., 2016a), suggest that in some cases measures might be used in clinical practice without knowing whether they are suitability to the client population.

Underrepresented Populations: Asian Americans, Native Americans, and Immigrants
Information on Asian Americans, Pacific Islanders, and Native Americans was heavily lacking for all measures. Only three studies had over 15% Asian-identifying participants; two did not provide any data about ethnic differences, and the third (Yu et al., 2016) focused on Chinese and Korean immigrants, not US-born Asian Americans. Native Americans were not even examined.
These absences are problematic as both populations are highly vulnerable. American Indian/Alaska Native adolescents have the highest rates of depression of all ethnic groups (American Psychiatric Association, 2017). Both Native American and Asian American adolescents have rates of suicide ideation and suicide attempts that exceed those of White adolescents (US Dept. of Health & Human Services: Office of Minority Health, 2018a, b). There is also evidence of differences in symptom reports for parents of Asian and Pacific Islander ethnicity compared to non-minority parents (Okamura et al., 2016) even when child self-reports are similar, indicating a need for measurement tools tailored to these groups.
Immigrant families were also under-represented. There were a few studies involving Mexican immigrants or parents who spoke Spanish, but none with immigrants from other language groups. Currently one out of ten U.S. inhabitants (12.9%) and nearly one third of California inhabitants (27.2%) are foreign born, and that number is predicted to rise in the future (Trevelyan et al., 2016;U.S. Census Bureau, 2016), therefore this population needs further attention.

Lack of Evidence Does Not Equal Low Quality
A brief caveat: the lack of robust testing does not indicate a measure's lack of suitability for diverse client populations, simply that further testing is required. We encourage interested researchers to fill in the gaps discussed here, and we encourage developers to broadly pilot their measures across gender, race and ethnicity, and social class. In the next section, we briefly highlight ways a standardized measure may fall short when administered with populations for whom it was not designed.

How Culture, Class, or Language Impact the Accuracy of Standardized Measures
It is important for care providers to be aware of how parents' background, upbringing, and even language, 13 can impact their responses on standardized measures and interviews, and how unintentional misreporting or miscommunication about a child's status can cause outsized effects on children and families. Cultural and linguistic differences are not limited to immigrant parents, we can find different cultural values and dialects across classes, racial or ethnic groups, and even geographic communities within the U.S.
Misdiagnoses ("false positives") due to unclear reporting put financial burden and emotional distress on a family in addition to lost time (Au-Yeung et al., 2018;Baker & Bell, 1999). Failure to diagnose a problem ("false negatives")-for example because questions were not understood or symptoms were not described in a way that the clinician recognized (Bailey et al., 2009;Shen et al., 2018)-can also lead to inappropriate treatment, or even denial of services if a standardized diagnostic tool does not accurately show a child's level of need.
To give a severe example, measurement tools that assume Euro-American childrearing practices have led to First Nations parents being judged unfit (and children removed from their homes), because they are designed to assess a two-parent nuclear family rather than an extended family of multiple primary caregivers (Choate & McKenzie, 2015).
The importance of accurate reporting becomes heightened as primary care providers are being given increased responsibility in diagnosing, referring, and sometimes even treating mental illness issues themselves (Glazier et al., 2015), especially with low income families (Hodgkinson et al., 2017). While pediatricians are adapting to meet this need (Foy et al., 2019), they do not have the same level of training as a specialist and must rely more heavily on standardized measurement tools.

Specific Issues When Administering Measures
Details of racial, ethnic and cultural bias in test design have been extensively discussed elsewhere (see e.g. Reynolds et al., 2021 for a historical overview) and there is also a body of literature on how to develop and adapt measures to different patient populations.
Here we would like to briefly point out some specific misreport issues, so that providers can be aware of and take these into account when administering standardized measures: Parents may have different benchmarks of severity when asked to rank a behavior as "problematic" or "burdensome." Their ratings may be affected by cultural attitudes toward certain behavioral problems (Heiervang et al., 2008) or by the level and type of caregiving support available. In such cases, a teacher report may also be of assistance.
Parent/youth may be unfamiliar with the questionnaire structure. Parents and youth of different cultural backgrounds or education levels may not understand or be comfortable with certain types of standardized questions, such as rating an emotion on a numerical scale (Lee et al., 2002) or grid-formatted questions (Ware et al., 1995). Some red flags to check for are skipped answers, especially if they are all the same type of question (Ware et al., 1995), or whether a parent often marks the "neutral" midpoint (e.g. 3 on a scale of 1-5) when asked for level of agreement/disagreement or how positively/negatively they feel about something (Lee et al., 2002).
Differences in word nuance or meaning can occur when a translation does not convey the nuance of the original language, as well as when parent and clinician (or measure developer) speak different geographic or class dialects (Epstein and et al., 2015;Leplège et al., 1998). Similarly, the same diagnostic term may have different presentations in different cultures (Haroz et al., 2017;Jani & Deforge, 2014), and folk illness categories may not translate neatly into biomedical illness categories (Flores et al., 2002).
The diagnostic tool may not ask about (or code for) the parent's concerns. Connected to the previous point, social environment and community values affect children's symptom presentation and what behaviors parents consider problematic. For example, Lambert et al. (2002) (discussed in the ASEBA section above) listed over 20 concerns that appeared in clinical case notes with African American parents but could not be coded in CBCL's schema. They hypothesized that many of these reflect the African American community's higher valuing of community support, mutual respect, and education.
Parent report may differ from child's self-report. ASEBA, SDQ, Ohio and Y-OQ all have a child version (typically for ages 11+) to supplement the parent report. While it is unsurprising that parent/teacher reports may differ from selfreports, the level of discrepancy can vary across ethnicity and across class (Ha et al., 1998;Okamura et al., 2016).
Clinician comprehension and unconscious bias will affect interpretation of parent concerns. Over half of the measures in our review 14 were designed to be completed by a clinician or social worker rather than the parent. In such cases, the accuracy of the tool is dependent on the provider being able to understand parental concerns and translate them into the standardized categories and ratings on the form. Many of the parent-report issues we discussed above also come into play in interviews with providers: perception of behavior severity, different word nuances, even parents' level of familiarity with clinical interviews and how to describe mental health symptoms (Probst et al., 2007).
Clinician's own biases also affect the accuracy of the report. Even well-intentioned providers may hold false beliefs or unconscious stereotypes about racial or ethnic groups (Bailey et al., 2019;Hoffman et al., 2016). Such beliefs may, on the one hand, cause providers to rank an item as less/more severe, or a symptom as less/more abnormal, for similarly presenting patients of different races. On the other hand, unawareness of real differences in cultural, social, and religious upbringing may hinder providers' ability to recognize key symptoms when described in nonmainstream ways. For example folk explanations involving blood temperature (Elderkin-Thompson et al., 2001) or possession by evil spirits (Malgady et al., 1987) may be perceived as irrelevant or, even worse, as symptoms of psychosis.

Addressing Clinician Bias and Developing Cultural Awareness
Providing culturally appropriate care goes beyond merely avoiding racist or classist stereotypes. Because biases and gaps in cultural knowledge are often unconscious, the care provider must actively work at being "anti-racist" and at educating oneself about social differences (Cénat, 2020). Here are some helpful starting points for clinicians and social workers interested in this topic: Cénat (2020) gives a timely guide on how to address the needs of Black patients within the larger social context of racial violence and police brutality. McGregor et al.'s (1998) holistic approach to Hawai'ian health assessment incorporates family, community, spirituality, and relationship to the local ecology. Thyer (2015) critiques the DSM while McQuaide (1999) discusses its pros and cons in social work. Finally, Rohe (1985) reveals the impact of urban planning: how the physical layout of a city or community can benefit or impair its residents' mental health.

Norms Cutoffs and Item Weighting
A common fix is to adjust clinical cutoffs, or even individual item weights, in accordance with the baseline of a community. For example, if cultural ideas of appropriate child behavior lead to consistently higher/lower ratings of impulsiveness or aggression, these items could be re-weighted in the final score.
However, there is strong danger in this solution. Using race-or class-based scoring sheets without understanding the underlying reasons for the reported differences runs the risk of reifying harmful stereotypes about populations (e.g. as innately more aggressive or with lower self-control) rather than understanding and addressing real social inequalities (Gasquoine, 2009).
That said, SDQ and ASEBA both have publicly available data on mean scores for community populations in various countries. ASEBA's website uses three "norm groups," (with some countries falling into different groups for different questionnaires e.g. CBCL versus YSR). It also has a Module with Multicultural Options (MMO) supplement that allows clinicians to compare a child's score with the grading scales for multiple countries (e.g. looking at both the immigrant parent's home country and the family's country of residence). SDQ's webpage lists means by age and gender, as well as a searchable database of published research studies.

Using Internationally Developed Questionnaires for Immigrant Families
Providers working with immigrant families may be asking whether they should look for measures validated in the family's country of origin. Unfortunately, this can be equally problematic, as immigrant children are not raised in the same social environment as their cousins abroad. For example, Shen et al. (2017) developed and piloted a set of scales to measure anxiety, depression, and school problems in Chinese middle school students, but because their tool is designed for children immersed in China's intense examination-oriented education system, it is not useable for Chinese-immigrant adolescents attending U.S. schools.

Creating Mental Health Measures for Minority Youth, Two Examples for Native American Youth
Finally, we offer two examples of mental health measures that were designed for, and tested with, Native American and First Nations youth. While these do not fit our original criteria (one is outside our age range, the other does not provide an overall mental health score), they may be of use as supplements to clinicians. Additionally, the details of their development may serve as models for researchers.
The Life Trajectory Interview for Youth was designed to "address gaps in our understanding of the links between large-scale structural conditions and social processes and individual outcomes such as mental health." It was developed and pilot tested on a 60% Anglo/40% Cherokee sample of young adults (aged 19-24) living in western North Carolina. The measure covers four domains: life course milestones, life course barriers, social affordances, and material goods (Brown et al., 2006).
The Cultural Connectedness Scale (CCS) was designed for and piloted with Canadian First Nations, Métis, and Inuit youth (Snowshoe et al., 2015). Although not itself a mental health scale, its developers showed a link between cultural connectedness and mental wellness for these population (Snowshoe et al., 2017).

Use of Appropriate Benchmarks When Testing or Developing Measures
As noted earlier, several of the measures we reviewed (particularly CBCL) were used to test validity or correlations in psychometric tests of others measures. CBCL was the most widely used and it is a well-established and broadly tested measure. While we have discussed the importance of using diverse and representative study samples, it is also important for measure developers and researchers to beware of hidden biases or disparities within the benchmarks being used with these populations. For example, several of the studies reviewed here tested for correlations between a measure's report of behavioral problems and poor academic performance (e.g. Hill & Hughes, 2007;Hodges & Wong, 1996) or school disciplinary referrals (e.g. Hill & Hughes, 2007;Hodges & Wong, 1996;Hogez & McKay, 1986;Murphy et al., 1999;Nelson et al., 2002;Reed & Edelbrock, 1983). Children of low SES families and minority children are at disproportionate risk of both of these, particularly African American boys (Heath, 1982;Skiba et al., 2002).

Scope and Limitations of this Review
Because the original goal of this project was to find candidate measures suitable for use in California DHCS outpatient child mental health care, we narrowed the scope of our review in the following ways: (1) The initial five-year exploratory search focused on English-language articles and only public or community clinics; this may have underrepresented some measures' popularity internationally or in other systems of care. (2) We only examined psychometrics tests on U.S. populations and (3) excluded those focusing on treatments or conditions which would not be covered in DHCS outpatient settings (e.g. substance abuse or skills training for autism). (4) Because we were interested in the reliability and validity of the measure itself, we did not examine studies that only compared the fidelity of a translation to the English version, but we did include non-English translations that were tested against independent metrics.
As noted, sample representativeness was examined for gender, race and ethnicity, and class/SES, as these were the three variables reported in most studies. While other participant characteristics such as adoption or type of family (e.g. foster families or same-sex parents) were not systematically compared, they are listed in Table 3 and in the sections focusing on individual measures.

Conclusion
By laying out the available and missing information about these ten clinical outcome measures, our goal has been to assist practitioners, researchers, and legislators in selecting appropriate standardized measures that have been tested and normed on samples that resemble their own client populations. In the process, we discovered that some popular measures lacked breadth of testing on diverse patient and community populations. Therefore, a second emerging goal of this paper has been to give clinicians insight into how cultural and linguistic differences (including those between racial, ethnic, and class groups) can impact measurement reports.
Testing with Asian American, Pacific Islander, and Native American families should be a high priority, as well as comparisons across classes. Further testing with immigrant families of various backgrounds is also needed. Because measures such as CANS, Y-OQ, and Ohio are already in wide clinical use, patient records may yield a good source of data, although we have discussed some of the caveats of using them.
Finally, researchers who are in the process of creating or adapting their own measures are encouraged to include these under-examined populations starting from the earliest stages of development and pilot testing.