Journal of Abnormal Child Psychology

, Volume 38, Issue 8, pp 1179–1191

When to Use Broader Internalising and Externalising Subscales Instead of the Hypothesised Five Subscales on the Strengths and Difficulties Questionnaire (SDQ): Data from British Parents, Teachers and Children

Authors

    • Department of Epidemiology and Population HealthLondon School of Hygiene & Tropical Medicine
  • Donna L. Lamping
    • Health Services Research UnitLondon School of Hygiene & Tropical Medicine
  • George B. Ploubidis
    • Department of Epidemiology and Population HealthLondon School of Hygiene & Tropical Medicine
Article

DOI: 10.1007/s10802-010-9434-x

Cite this article as:
Goodman, A., Lamping, D.L. & Ploubidis, G.B. J Abnorm Child Psychol (2010) 38: 1179. doi:10.1007/s10802-010-9434-x

Abstract

The Strengths and Difficulties Questionnaire (SDQ) is a widely used child mental health questionnaire with five hypothesised subscales. There is theoretical and preliminary empirical support for combining the SDQ’s hypothesised emotional and peer subscales into an ‘internalizing’ subscale and the hypothesised behavioral and hyperactivity subscales into an ‘externalizing’ subscale (alongside the fifth prosocial subscale). We examine this using parent, teacher and youth SDQ data from a representative sample of 5–16 year olds in Britain (N = 18,222). Factor analyses generally supported second-order internalizing and externalizing factors, and the internalizing and externalizing subscales showed good convergent and discriminant validity across informants and with respect to clinical disorder. By contrast, discriminant validity was poorer between the emotional and peer subscales and between the behavioral, hyperactivity and prosocial subscales. This applied particularly to children with low scores on those subscales. We conclude that there are advantages to using the broader internalizing and externalizing SDQ subscales for analyses in low-risk samples, while retaining all five subscales when screening for disorder.

Keywords

Strengths and Difficulties QuestionnaireFactor structureConstruct validityBritainInternalizing problemsExternalizing problems

Introduction

The Strengths and Difficulties Questionnaire (SDQ) is one of the most widely used brief questionnaires for assessing child mental health problems. In the decade since its development, it has been used in low-, middle- and high-income settings around the world (reviewed in Achenbach et al. 2008; Woerner et al. 2004b). The SDQ can be completed by parents and teachers of children aged 4–16 and by youth aged 11–16.

The SDQ consists of 25 items covering five subscales relating to emotional problems, peer problems, behavioral problems, hyperactivity and prosocial behavior (Goodman 1997). The SDQ total difficulties score, which is the sum of the emotional, peer, behavioral and hyperactivity subscales, has been found to be a psychometrically sound measure of overall child mental health problems in studies from around the world (Achenbach et al. 2008; Goodman and Goodman 2009; Goodman 1997, 1999; Goodman et al. 2000b; Goodman and Scott 1999; Klasen et al. 2000; Mullick and Goodman 2001). This includes evidence that the total difficulties score is correlated with existing questionnaire and interview measures, differentiates clinic and community samples, and is associated with increasing rates of clinician-rated diagnoses of child mental disorder across its full range.

Nevertheless, the internal structure of the SDQ is one area where there is ongoing controversy. The SDQ items and subscales were developed with reference to the main nosological categories recognised by contemporary classification systems of child mental disorders such as the Diagnostic and Statistical Manual of Mental Disorders, 4th edition (DSM-IV; American Psychiatric Association 1994). The five subscales were then refined through exploratory factor analyses (EFAs: Goodman 1997), and have since been supported by EFAs in multiple samples from across Europe (Becker et al. 2006; Goodman 2001; Smedje et al. 1999; Woerner et al. 2004a). Yet EFAs are an exploratory technique, primarily useful in suggesting possible factor structures when these are not known. When a hypothesised factor structure exists, it is more appropriate to use a model-based framework such as confirmatory factor analysis (CFA: Brown 2006).

Those CFAs which have been carried out provide at best mixed support for the SDQs five-factor structure. CFAs in Norway (youth SDQ) and Australia (parent, teacher and youth SDQ) found that models based on the hypothesised five factors did not show acceptable model fit for some or all indices considered (Mellor and Stokes 2007; Ronning et al. 2004). Other CFAs in Belgium (parent and teacher SDQ) and Russia (youth SDQ) do report adequate global fit, but also note that loadings on several items were unacceptably low (<0.4) (Ruchkin et al. 2007; Van Leeuwen et al. 2006).

This problematic evidence from CFAs suggests the possible value of considering alternative factor structures. One alternative which can be justified on theoretical grounds would combine the emotional and peer items into an ‘internalizing’ subscale and the behavioral and hyperactivity items into an ‘externalizing’ subscale. This approach receives some support from exploratory analyses; approximately internalizing/externalizing/prosocial factor structures have been reported in three-factor EFAs from the US (parent SDQ), Belgium (parent and teacher SDQ) and Finland (youth SDQ) (Dickey and Blumberg 2004; Koskelainen et al. 2001; Van Leeuwen et al. 2006). A first-order model based on this three-factor solution showed adequate fit to the data in a CFA in the US sample, although the authors do not present CFAs of the five-factor solution for comparison (Dickey and Blumberg 2004). By contrast, in the Belgium sample the three-factor solution did not achieve acceptable fit in a CFA and showed poorer fit the five-factor model (Van Leeuwen et al. 2006).

These analyses therefore suggest that internalizing and externalizing factors may form part of the factor structure of the SDQ, but are not conclusive and do not investigate this issue in detail. There has also been no evaluation of other aspects of the construct validity of these theoretically plausible internalizing and externalizing subscales. Indeed, even for the five hypothesised SDQ subscales, almost all investigations of construct validity start and end with factor analyses such as those cited above. Far less use has been made of alternative approaches such as assessing convergent and discriminant validity—that is, the extent to which different subscales tap into distinct aspects of child mental health. Nevertheless such analyses have the potential to be highly informative in clarifying whether, or under what circumstances, these SDQ subscales are valid for use as screening devices for clinical disorder or as explanatory or outcome variables in epidemiological studies.

In this paper, we therefore compare different models whereby the hypothesised internalizing and externalizing subscales could form part of the factor structure of the SDQ. We then evaluate the convergent/discriminant validity of the internalizing/externalizing SDQ subscales, and compare their performance with the hypothesised five subscales. The two British surveys we use (the British Child and Adolescent Mental Health Surveys of 1999 and 2004) have not previously been used for these purposes, although other psychometric analyses (e.g. Cronbach’s alpha, principal component analyses) have been published for the earlier survey (Goodman 2001).

Methods

Description of Sample

The British Child and Adolescent Mental Health Surveys (B-CAMHS) were two nationally-representative surveys conducted in England, Scotland and Wales in 1999 and 2004. Children aged 5–15 years were sampled in B-CAMHS99 and 5–16 years in B-CAMHS04, using the British Child Benefit Register as a sampling frame; full details have been published elsewhere (Green et al. 2005; Meltzer et al. 2000). Between the two B-CAMHS surveys, 26,544 children and adolescents were selected and their principal caregivers (‘parents’) were approached for face-to-face interview. Of these, 18,415 (69%) participated giving a sample which was 50.7% male with mean age 10.2 years. Parent SDQ data were available for 18,222 (99.0%) participants. With parental permission, teachers were also approached to participate (by postal questionnaire), as were the 11–16 year-olds themselves (by face-to-face interview). This resulted in SDQ data from 14,263 teachers (77.4% of participants) and 7,678 youth (91.9% of participants aged 11–16).

Both B-CAMHS surveys included a three-year follow-up. B-CAMHS99 followed-up all children with a disorder at baseline and a third of children with no disorder at baseline (Meltzer et al. 2003). B-CAMHS04 followed-up all children, regardless of disorder status at baseline (Parry-Langdon et al. 2008). In total, 11,222 children were selected for follow-up and 7,912 (70.5%) participated, giving a sample that was 51.7% male with mean age 13.2 years.

Description of Measures

All participating parents, teachers and children were administered the Strengths and Difficulties Questionnaire (SDQ). As described above, this is a 25-item questionnaire with five hypothesised subscales: emotional problems, peer problems, behavioral problems, hyperactivity and prosocial behavior (Goodman 1997, 2001). Each subscale comprises five questions with 3-point response scales (‘Not true’ = 0, ‘Somewhat true’ = 1, ‘Certainly true’ = 2), with a subscale score range of 0–10. Ten of the 25 items are positively worded ‘strengths’; these are reversed scored if they contribute to the emotional, peer, behavioral or hyperactivity subscales. In this paper, we also assess the construct validity of alternative ten-item ‘internalizing’ (emotional and peer items) and ‘externalizing’ subscales (behavioral and hyperactivity items) with ranges of 0–20. Throughout this paper, we excluded the small number of SDQs which were missing one or more subscale scores (<0.4% for parent, teacher and youth SDQs).

After completing the SDQ, all B-CAMHS participants completed the Development and Well-being Assessment (DAWBA). This is a detailed psychiatric interview administered by lay interviewers to parents and youth, with a briefer questionnaire for teachers (Goodman et al. 2000a). Each section of the DAWBA uses skip-rules, one component of which is the relevant SDQ subscale; for example, the hyperactivity SDQ subscale for the hyperactivity disorder section. Each section begins with structured questions that cover the operationalised diagnostic criteria for DSM-IV (American Psychiatric Association 1994). Structured questions are supplemented by open-ended questions which record verbatim a respondent’s own description of problem areas. Clinicians review the closed and open responses from all informants, identifying discrepancies within or between informants, and using the content, length and tone of the transcripts to interpret conflicting information (Meltzer et al. 2000). On this basis, raters decide whether a particular child meets all the relevant DSM-IV criteria for an operationalised mental disorder. Raters can also assign ‘Not Otherwise Specified’ disorder, for example ‘behavioral disorder, not otherwise specified’ when children have substantial impairment from symptoms which do not quite meet operationalised criteria. In this paper, we group the mental disorders into emotional disorders (including anxiety and depressive disorders); behavioral disorders (including oppositional defiant and conduct disorder); attention-deficit/hyperactivity disorder (ADHD); and autistic spectrum disorders (ASD: including autism and Asperger syndrome). In British samples (including B-CAMHS), the DAWBA has been shown to have good inter-rater reliability (e.g. kappa 0.86 for inter-rater agreement for ‘any mental disorder’ in an epidemiological sample (Ford et al. 2003)). It also has good validity as judged against case-notes diagnoses, performs well in differentiating clinic/community samples, and shows strong associations with risk factors, service use and three-year prognosis (Ford et al. 2003; Goodman et al. 2000a; Meltzer et al. 2003).

Statistical Analyses

Factor Structure of the SDQ

We used confirmatory factor analysis (CFA) to evaluate and compare the relative fit of a number of alternative factor structures for the parent, teacher and youth baseline SDQs. As shown in Fig. 1, these were a first order model with the five hypothesised SDQ factors (Model A); a second order model with additional ‘internalizing’ and ‘externalizing’ factors (Model B); and a three-factor first order model in which internalizing and externalizing factors replaced the emotional, peer, behavioral and hyperactivity factors (Model C).
https://static-content.springer.com/image/art%3A10.1007%2Fs10802-010-9434-x/MediaObjects/10802_2010_9434_Fig1_HTML.gif
Fig. 1

Models used in Confirmatory Factor Analyses of the parent, teacher and youth SDQ

We performed the CFA in MPlus5, using a multivariate probit analysis for ordinal data (Muthen 1983, 1984) and estimating model fit using the Weighted Least Squares, mean and variance adjusted (WLSMV) estimator. We follow common practice in reporting multiple indices of fit, namely the Comparative Fit Index (CFI), the Tucker Lewis Index (TLI) and the Root Mean Square Error of Approximation (RMSEA) (Brown 2006; Hu and Bentler 1999). To consider a model as showing ‘acceptable’ fit, we required a CFI > 0.90; TLI > 0.90; and RMSEA < 0.08; to consider a model as showing ‘good’ fit, we required a CFI > 0.95; TLI > 0.95; and RMSEA < 0.06 (Brown 2006). Where models showed acceptable fit on some indices but not on others, we allowed correlations between the unique variances of some individual items within the same factor, selecting these item pairs using MPlus’ modification indices. Such minor model modifications can improve model fit by increasing the proportion of variance explained, but do not change the substantive conclusions regarding the adequacy of a hypothesised factor structure in describing a set of data (Bollen 1989).

Construct Validity of the SDQ Subscales Across Informants

Multitrait-multimethod (MTMM) analyses are a method for assessing the construct validity of a set of measures (Campbell and Fiske 1959; Nunnally and Bernstein 1994). MTMM are based on a correlation matrix of multiple traits (e.g. the proposed SDQ subscales) measured by multiple methods (e.g. parent, teacher, youth). These can assess construct validity through comparisons across informants. For example, correlations between the parent and teacher behavioral subscales (a convergent correlation coefficient) would be expected to be higher than between the parent behavioral and teacher hyperactivity subscales (a discriminant correlation coefficient). If this aspect of construct validity cannot be demonstrated, this indicates that the behavioral and hyperactivity subscales may not be tapping into the same, distinct constructs across informants.

We performed the MTMM analyses using subscales created by adding up the relevant items and not using the latent variables created through factor analyses. We did this because we believe that most users of the SDQ will prefer to use these simple, transparent scores, and that it is therefore their convergent and discriminant validity which it is most useful and most relevant to present. We assessed correlations between the (ordered) SDQ subscales using Spearman’s correlations, calculated in Stata 10.2 and basing each correlation coefficient upon all individuals with the relevant SDQ data. We also present the Cronbach alpha for each, as a measure of internal consistency.

Construct Validity of the SDQ Subscales Relative to the DAWBA

MTMM analyses assess construct validity by comparing different informants. Comparing the SDQ and the DAWBA provides a further method of evaluating construct validity. The a priori prediction is that DAWBA diagnoses of emotional disorders should correlate most highly with the emotional SDQ subscale of the parent, teacher and youth SDQs; behavioral disorders with the behavioral subscale; ADHD with the hyperactivity subscale; and ASD with the peer and prosocial subscales. We performed a series of logistic regression analyses in Stata 10.2 on four outcomes: DAWBA diagnosis for any emotional disorder, any behavioral disorder, ADHD, or ASD. For the explanatory variables, we first used the five hypothesised SDQ subscales from the same informant. We then repeated these analyses using the three internalizing, externalizing and prosocial subscales. We reverse-scored the prosocial subscale for these analyses in order to facilitate comparisons of effect sizes across subscales.

Predicting baseline DAWBA diagnoses using baseline SDQ subscale scores is somewhat circular because the SDQ subscales form part of the skip rules for some DAWBA sections. High SDQ scores at baseline could therefore increase the probability of a DAWBA diagnosis at baseline simply by increasing the amount of mental health information collected. We therefore used DAWBA diagnoses at three-year follow-up, as these were administered and rated blind to SDQ score or DAWBA diagnosis at baseline. In doing so, we used weights to adjust for the fact that B-CAMHS99 did not seek to follow up all children but rather over-sampled children who had a disorder at baseline. We decided not to use the youth SDQ to predict ASD because only 10/71 children with a follow-up diagnosis of ASD completed youth SDQs at baseline, and these individuals may lack insight as informants.

Results

Internal Factor Structure of the SDQ

Table 1 presents the first-order model of the five hypothesised SDQ factors (Model A) for the parent, teacher and youth SDQs. Of the 75 standardised loadings (25 items times 3 informants), 37 were high (≥0.7) 36 were moderate (0.4–0.69) and only two (‘good friend’ and ‘best with adults’ on the youth SDQ) were unacceptably low (0.3–0.39). For all informants, Model A initially failed to demonstrate acceptable fit for at least one of the reported indices of global fit (CFI < 0.90 for parents; RMSEA > 0.08 for teachers; CFI and TLI < 0.90 for youth). As reported in Table 2, just acceptable fit was usually achieved after allowing the unique variance to correlate between some items within the same factor, although the CFI in youth remained low (0.858). Taken together these results indicate that the hypothesised first order factor structure shows an ‘acceptable’ but not a ‘good’ fit to the parent, teacher and child SDQ data.
Table 1

Model Fit and Fully Standardised Item Loadings from First Order Five-factor Confirmatory Factor Analyses of the Parent, Teacher and Youth SDQs (Model A)

  

Parent

Teacher

Youth

N

18,222

14,263

7,678

Model fit

CFI = 0.857, TLI = 0.934, RMSEA = 0.059

CFI = 0.905 TLI = 0.963 RMSEA = 0.085

CFI = 0.837, TLI = 0.885, RMSEA = 0.063

Standardised loadings

   

Factors

Items

   

Emotional problems

Somatic

0.46

0.64

0.48

Worries

0.68

0.78

0.66

Unhappy

0.86

0.92

0.77

Clingy

0.60

0.77

0.56

Fears

0.70

0.84

0.67

Peer problems

Solitary

0.50

0.54

0.47

Good frienda

−0.67

−0.80

−0.34

Populara

−0.82

−0.97

−0.58

Bullied

0.67

0.58

0.73

Best with adults

0.49

0.40

0.30

Behavioral problems

Tempers

0.67

0.77

0.66

Obedienta

−0.71

−0.82

−0.59

Fights

0.73

0.87

0.59

Lies

0.72

0.86

0.70

Steals

0.68

0.71

0.59

Hyperactivity problems

Restless

0.73

0.90

0.56

Fidgety

0.78

0.91

0.65

Distractible

0.80

0.90

0.74

Reflectivea

−0.69

−0.88

−0.59

Persistenta

−0.75

−0.88

−0.65

Prosocial behavior

Consideratea

0.82

0.92

0.76

Sharesa

0.71

0.80

0.56

Caringa

0.66

0.85

0.66

Kind to kidsa

0.68

0.80

0.66

Helps outa

0.52

0.69

0.59

Correlation of subscales

E with P: 0.71

E with P: 0.66

E with P: 0.69

E with B: 0.51

E with B: 0.34

E with B: 0.53

E with H: 0.40

E with H: 0.33

E with H: 0.48

E with Pr: −0.26

E with Pr: −0.24

E with Pr: −0.02

P with B: 0.58

P with B: 0.67

P with B: 0.47

P with H: 0.49

P with H: 0.54

P with H: 0.38

P with Pr: −0.47

P with Pr: −0.67

P with Pr: −0.45

B with H: 0.71

B with H: 0.81

B with H: 0.85

B with Pr: −0.70

B with Pr: −0.82

B with Pr: −0.54

H with Pr −0.50

H with Pr −0.70

H with Pr −0.49

Results from Model A, as defined in Fig. 1. aindicates positively worded ‘strengths’ items. E=emotional latent score, P=peer latent score, B=behavioral latent score, H=hyperactivity latent score, P=prosocial latent score

Table 2

Model Fit in Confirmatory Factor Analyses of the Parent, Teacher and Youth SDQs

  

CFI

TLI

RMSEA

Parent (N = 18,222)

Model A

0.857

0.934

0.059

Model A, plus minor modificationsa

0.901

0.954

0.049

Model B, plus minor modificationsa

0.900

0.953

0.049

Model C, plus minor modificationsa

0.871

0.938

0.057

Teacher (N = 14,263)

Model A

0.905

0.963

0.085

Model A, plus minor modificationsa

0.919

0.970

0.077

Model B, plus minor modificationsa

0.921

0.969

0.078

Model C, plus minor modificationsa

0.877

0.948

0.101

Youth (N = 7,678)

Model A

0.837

0.885

0.063

Model A, plus minor modificationsa

0.858

0.900

0.059

Model B, plus minor modificationsa

0.860

0.901

0.058

Model C, plus minor modificationsa

0.838

0.885

0.063

Models A, B and C defined in Fig. 1. aParent minor modifications: allowing correlation between the unique variance of (Clingy & Fears) (Solitary & Best with adults) (Restless & Fidgety) (Distractible & Persistent) (Reflective & Persistent). Teacher minor modifications: allowing correlation between the unique variance of (Worries & Fears) (Clingy & Fears) (Solitary & Best with adults) (Restless & Fidgety). Youth minor modifications: allowing correlation between the unique variance of (Restless & Fidgety)

Table 1 also shows high correlations in all informants between the emotional and peer latent scores (0.66–0.71), and between the behavioral and hyperactivity subscales (0.71–0.81). This provides empirical support for our theory-driven intention to fit second-order internalizing and externalizing factors to capture these correlations, as shown in Model B. As is typical when comparing first-order and second-order models, there was relatively little difference between the fit of Model A and Model B—i.e. the second-order model showed a fit to the data which was ‘acceptable’ but generally not ‘good’. This therefore supports the potential legitimacy of treating internalizing and externalizing problems as broader factors subsuming the hypothesised subscales (although also highlights the fact that fitting this more complex model may not be necessary if one simply wishes to perform a CFA analyses to assess model fit). By contrast, replacing the emotional, peer, behavioral and hyperactivity factors with first order internalizing and externalizing factors (Model C) led to poorer model fit, indicating that this is not a legitimate simplification.

Construct Validity of the SDQ Subscales Across Informants

Table 3 presents an MTMM analysis of the five SDQ subscales, created by summing the relevant five items from the parent, teacher and youth SDQs. The Cronbach alpha coefficients were almost all 0.65–0.85, indicating good internal reliability; the two exceptions were the peer problems subscales reported by parents (α = 0.58) and youth (α = 0.44). The cross-method correlations of the same traits are presented in bold; all were significantly different from zero (p < 0.001) but were only low to moderate in magnitude (0.20–0.47). These convergent correlations were therefore similar in magnitude to the correlations between different subscales from the same informant.
Table 3

MTMM Analyses for the Five Hypothesised SDQ Subscales

https://static-content.springer.com/image/art%3A10.1007%2Fs10802-010-9434-x/MediaObjects/10802_2010_9434_Tab3_HTML.gif

Emo=emotional SDQ subscale, peer=peer problems, behav=behavioral, hyp=hyperactivity, pro=prosocial. N = 18,222 parents; N = 14,263 teachers and N = 7,678 youth. N = 14,139 for the parent-teacher comparison, N = 7,561 for the parent-youth comparison and N = 5,755 for the teacher-youth comparison. Values in cells are Spearman’s correlation coefficients, except values in the diagonals which are Cronbach’s alphas. Cross-method correlations of same traits are presented in bold. Cells circled with solid lines indicate problematic discriminant validity for the behavioral subscale relative to the hyperactivity subscale. Cells circled with dashed lines indicate problematic discriminant validity for the prosocial subscale relative to the behavioral and hyperactivity subscales

In most cases the convergent correlations were significantly larger (p < 0.01) than the other correlation coefficients in the same row or column (the discriminant correlations). There were, however, two important exceptions. First, in all three informant pairs, behavioral disorders did not show good discriminant validity relative to hyperactivity problems (relevant cells circled with solid line). For example, the correlation of parent behavioral and teacher behavioral scores was 0.31, no higher than the correlation between parent behavioral and teacher hyperactivity scores (0.31) and slightly lower than the correlation between parent hyperactivity and teacher behavioral scores (0.33). Second, the teacher prosocial subscale did not show discriminant validity relative to the behavioral and hyperactivity subscales reported by either parents or youth (relevant cells circled with dashed line).

The behavioral, hyperactivity and prosocial subscales therefore showed poor discriminant validity. Likewise the convergent correlations for the emotional and peer subscales were often not much larger than the discriminant correlations (although owing to the large sample size, all the differences were nonetheless significant at p < 0.01). These findings therefore do not support the claim that these five subscales tap into the same, distinct aspects of child mental health problems across all informants. By contrast, for the internalizing-externalizing contrast convergent and discriminant validity was much more satisfactory (see Table 4). However the prosocial subscale, particularly the teacher prosocial subscale, continued to show poor discriminant validity relative to the externalizing scale.
Table 4

MTMM Analyses for the Internalizing, Externalizing and Prosocial SDQ Subscales

https://static-content.springer.com/image/art%3A10.1007%2Fs10802-010-9434-x/MediaObjects/10802_2010_9434_Tab4_HTML.gif

Int=internalizing, ext=externalizing, pro=prosocial SDQ subscales. N = 18,222 parents; N = 14,263 teachers and N = 7,678 youth. N = 14,139 for the parent-teacher comparison, N = 7,561 for the parent-youth comparison and N = 5,755 for the teacher-youth comparison. Values in cells are Spearman’s correlation coefficients, except values in the diagonals which are Cronbach’s alphas. Cross-method correlations of same traits are presented in bold. Cells circled with dashed lines indicate problematic discriminant validity for the prosocial subscale relative to the externalizing subscales

Construct Validity of the SDQ Subscales Relative to the DAWBA

Both the baseline and the three-year follow-up prevalences of emotional, behavioral, ADHD and ASD generally showed monotonic increases across the full range of the corresponding parent, teacher and youth SDQ subscales at baseline (results available from www.sdqinfo.com/point_by_point.pdf). Among the five hypothesised SDQ subscales, Table 5 shows which subscales had the largest effect upon the odds of receiving a DAWBA diagnoses at three-year follow-up (note that the prosocial subscale is reverse scored). For the parent and teacher SDQ, the expected subscale(s) always had the largest point estimates of effect size. These point estimates were also usually substantially and significantly larger than the next-largest estimates, except for the teacher emotional subscale (predicting to emotional disorder) and sometimes in the comparatively under-powered analyses predicting to ASD. For the youth SDQ evidence of discriminant validity was less convincing: the emotional subscale was no more strongly associated with emotional disorder than the peer subscale, and the hyperactivity subscale no more strongly associated with ADHD than the behavioral subscale.
Table 5

Independent Association of the Five SDQ Subscales at Baseline with DAWBA Diagnosis at Follow-up (OR and 95%CI)

https://static-content.springer.com/image/art%3A10.1007%2Fs10802-010-9434-x/MediaObjects/10802_2010_9434_Tab5_HTML.gif

*p < 0.05, **p < 0.01, ***p < 0.001. Odds ratios presented for probability of DAWBA diagnosis per one-point increase in the SDQ subscale in question. Below the odds ratios, the five subscales are presented in order of magnitude; subscales sharing an underline were not significantly different at p < 0.05. Note that the prosocial score is reverse-scored to facilitate comparisons of effect sizes. ASD was not used as an outcome for the youth SDQ

The five-factor structure therefore generally showed convergent and discriminant validity relative to DAWBA diagnoses for parent and teacher SDQ but not always for the youth SDQ. Moreover, even for the parent and teacher SDQs, there was some suggestion that the behavioral and hyperactivity subscales only showed discriminant validity at higher scores. This is illustrated for the parent SDQ in Fig. 2, which shows that below 7 SDQ points the behavioral and hyperactivity subscales were equally predictive of ADHD at follow-up. There is the suggestion of a similar effect below 3 SDQ points when predicting behavioral disorder at follow-up.
https://static-content.springer.com/image/art%3A10.1007%2Fs10802-010-9434-x/MediaObjects/10802_2010_9434_Fig2_HTML.gif
Fig. 2

Independent association of the five parent SDQ subscales at baseline with DAWBA diagnoses at follow-up. Analyses come from models identical to those described in Table 5, except that the SDQ subscales were entered as categorical terms by SDQ point rather than as linear scales. Subscale scores were also grouped once the number of children per point fell to 20 or fewer, to avoid estimates based on very small numbers. As in Table 5, the prosocial score is reverse-scored to facilitate comparisons of effect sizes

By contrast, the three-factor structure showed clear convergent and discriminant validity for all three informants (Table 6) and this was true even at the lowest SDQ scores. Graphs illustrating this can be found at www.sdqinfo.com/point_by_point.pdf, as can equivalent graphs to Fig. 2 for the teacher and youth SDQs.
Table 6

Independent Association of the Three SDQ Subscales at Baseline with DAWBA Diagnosis at Follow-up (OR and 95%CI)

  

Emotional DAWBA diagnosis

Behavioral DAWBA diagnosis

ADHD DAWBA diagnosis

ASD DAWBA diagnosis

Parents (N = 7,901)

Internalizing (In)

1.24 (1.20, 1.27)***

1.04 (1.00, 1.07)*

1.08 (1.02, 1.14)**

1.42 (1.34, 1.50)***

Externalizing (Ex)

1.06 (1.02, 1.09)**

1.38 (1.33, 1.42)***

1.54 (1.45, 1.63)***

1.00 (0.94, 1.06)

Not Prosocial (nP)

0.94 (0.87, 1.00)

1.07 (1.01, 1.13)*

0.88 (0.80, 0.97)**

1.74 (1.55, 1.95)***

Largest subscale predictors

In Ex nP

Ex nP In

Ex In nP

nP In Ex

Teachers (N = 6,247)

Internalizing (In)

1.13 (1.10, 1.17)***

1.03 (1.00, 1.07)*

1.04 (0.99, 1.10)

1.24 (1.17, 1.32)***

Externalizing (Ex)

1.05 (1.01, 1.09)*

1.23 (1.20, 1.28)***

1.32 (1.25, 1.39)***

1.02 (0.95, 1.10)

Not Prosocial (nP)

1.00 (0.94, 1.07)

1.05 (0.98, 1.12)

1.01 (0.91, 1.12)

1.45 (1.21, 1.73)***

Largest subscale predictors

In Ex nP

Ex nP In

Ex In nP

nP In Ex

Youth (N = 3,408)

Internalizing (In)

1.24 (1.18, 1.30)***

1.02 (0.97, 1.08)

1.03 (0.90, 1.18)

Externalizing (Ex)

1.05 (1.01, 1.10)*

1.31 (1.24, 1.38)***

1.36 (1.22, 1.53)***

Not Prosocial (nP)

0.91 (0.83, 1.00)

1.02 (0.92, 1.13)

1.04 (0.87, 1.25)

Largest subscale predictors

In Ex nP

Ex nP In

Ex nP In

 

*p < 0.05, **p < 0.01, ***p < 0.001. Odds ratios presented for probability of DAWBA diagnosis per one-point increase in the SDQ subscale in question. Below the odds ratios, the three subscales are presented in order of magnitude; subscales sharing an underline were not significantly different at p < 0.05. Note that the prosocial score is reverse-scored to facilitate comparisons of effect sizes. ASD was not used as an outcome for the youth SDQ

Discussion

We used data from 18,222 British children demonstrate the construct validity of an ‘internalizing’ subscale (emotional plus peer items) and an ‘externalizing’ subscale (behavioral plus hyperactivity items) in the Strengths and Difficulties Questionnaire (SDQ). Second-order internalizing and externalizing factors were generally supported by confirmatory factor analyses, although model fit was somewhat problematic for the youth SDQ. The internalizing/externalizing subscales also showed the clearest and most consistent evidence of convergent and discriminant validity across informants and with respect to clinical disorder. By contrast, cross-informant discriminant validity was poorer between the emotional and peer subscales and particularly poor between the behavioral, hyperactivity and prosocial subscales. This suggests that in low-risk, epidemiological samples these five subscales may not all tap into distinct aspects of child mental health. Avoiding these five subscales and instead using the broader internalizing and externalizing subscales may therefore be more appropriate when selecting explanatory and outcome variables for epidemiological studies. Yet all five subscales on the parent and teacher SDQs did show convergent and discriminant validity when predicting to clinical disorder. This was particularly true for children with high scores on these subscales. As such, retaining all five subscales appears likely to add additional value when screening for disorder or studying high-risk children.

Our confirmatory factor analyses (CFAs) represent the first systematic evaluation of whether the parent, teacher and youth SDQs contain internalizing and externalizing factors, and of how these relate to the hypothesised five subscales. Our analyses did not support replacing the emotional, peer, behavioral and hyperactivity subscales with internalizing and externalizing factors. Instead this simplification produced worse model fit in all informants, thereby replicating the one previous study (of the parent and teacher SDQ) which made this comparison (Van Leeuwen et al. 2006). By contrast, models which added second-order internalizing and externalizing factors did achieve acceptable values for all fit indices in the parent and teacher SDQ and for two out of three indices in the child SDQ. This provided some empirical support for our theoretically-driven proposal to evaluate the convergent and discriminant validity of the ten-item internalizing and externalizing SDQ subscales. Nevertheless, it should be noted that in all CFA analyses some indices of fit were ‘just acceptable’ rather than ‘good’. Moreover, on the youth SDQ the CFI index never achieved acceptable values and two item loadings were unacceptably low. These findings therefore add to the CFA evidence that the SDQ does not have a very clean internal factor structure (Mellor and Stokes 2007) but that the hypothesised five subscales may nonetheless provide a passable description (Ronning et al. 2004; Ruchkin et al. 2007; Van Leeuwen et al. 2006).

Our paper also extends the CFA literature by using additional approaches to evaluate construct validity. To our knowledge, this is the first time that full multitrait-multimethod (MTMM) analyses have been presented for the parent, teacher and youth SDQs. The convergent validity coefficients of 0.20–0.47 are lower than would be ideal, although this is typical in this respect of questionnaire measures of child psychopathology. For example, these values compare favourably to the inter-informant agreements reported in a meta-analysis of other child mental health questionnaires: 0.27 for parents and teachers, 0.25 between parents and children, 0.20 between teachers and children (Achenbach et al. 1987). More worrying is the poor discriminant validity between the behavioral and hyperactivity subscales. This indicates that when applied to general population samples, the ‘behavioral’ and ‘hyperactivity’ labels may be misleading as these subscales cannot be assumed to be tapping into distinct aspects of externalizing problems. The MTMM analyses raised similar concerns for the emotional vs. peer problems subscales, which likewise showed only weak evidence of cross-informant discriminant validity. The teacher prosocial subscale also did not show discriminant validity relative to the behavioral and hyperactivity subscales, suggesting that teachers may have been subsuming all these symptoms into a single ‘disruptive’/’helpful’ continuum.

These findings suggest that it would not be valid (for example) to use mean scores from the behavioral and hyperactivity SDQ subscales in order to compare the correlates of behavioral vs. hyperactivity problems. If the same covariates were found to predict both subscales, then this might simply reflect the two subscales measuring the same thing rather than a real similarity in the correlates of behavioral and hyperactivity problems. Although firm recommendations are not possible without further replication, our provisional conclusion is therefore that the broader internalizing and externalizing subscales may be more appropriate explanatory or outcome variables in epidemiological studies. The internalizing and externalizing subscales also have the advantage that their greater number of items would be expected to reduce measurement error. This consideration may be particularly important when some populations of interest are small in size (e.g. minority ethnic groups).

Yet despite their poor cross-informant discriminant validity in MTMM analyses, all five SDQ subscales showed good discriminant validity when predicting clinical disorders. This seemed to be particularly true at higher SDQ subscale scores. One possible explanation for this discrepancy is that the MTMM analyses reflect patterns of subscale association in the full B-CAMHS sample, which is mostly comprised of children without mental health problems. In this low-risk, general population sample there may not always be a clear-cut distinction between (for example) behavioral and hyperactivity symptoms or between externalizing symptoms and prosocial behavior. Working with many children, teachers may find it particularly hard to make such distinctions, which could explain why discriminant validity between the externalizing and prosocial symptoms was particularly poor on the teacher SDQ. By contrast, discriminating symptom clusters may be easier when focusing on children with more severe mental health problems. An analogy from clinical practice would be the greater ease of distinguishing depressive and anxiety disorders in mental health specialist clinics than in the general population (Goldberg and Huxley 1992).

We therefore conclude that there may be no single best set of subscales to use in the SDQ; rather, the optimal choice may depend in part upon one’s study population and study aims. Specifically, although the five hypothesised SDQ subscales should be treated with caution in low-risk samples, they do seem to add value when studying children with mental disorder and/or with higher SDQ scores. Strikingly, this applied not only to the emotional, behavioural and hyperactivity subscales when predicting the common child mental disorders, but also applied to the prosocial and peer problems subscales when predicting autistic spectrum disorders. Thus all five subscales appeared to have the potential to play a distinct, useful role when predicting child mental disorders, and this included subscales such as parent-reported peer problems which showed poor construct validity and internal reliability in the MTMM analyses. These findings are consistent with the fact that algorithms based on the five separate subscales have shown good performance in predicting type of disorder in clinics (Goodman et al. 2000b) or in the skip-rules of the DAWBA (Goodman et al. 2000a). They also highlight the vital importance of using multiple approaches to examine construct validity, and thereby building up a more complete and more nuanced picture of a measure’s performance. The unusually rich mental health data of our sample allowed us to go beyond most other studies in this regard, and we consider this a central strength of this paper.

Yet despite this key strength, our analyses and conclusions also have important limitations. The most important is the provisional nature of at conclusions regarding the optimal choice of SDQ subscales; firm recommendations must await replication in other studies. Other studies may also wish to use additional analytic approaches, such as conducting MTMM analyses within a CFA framework in order to estimate the convergent and discriminant correlation between the hypothesised latent trait (Brown 2006). Although arguably less transparent than using the simple summed scores (hence our decision to use the ‘traditional’ approach in this paper), this would have the advantage of reducing measurement error. Finally, future studies could usefully be extended by including evidence from a larger number of domains of child psychopathology. These may be important in revealing aspects of convergent or discriminant validity for the SDQ subscales which are not apparent here. For example, factor analyses in an Australian sample of 4–9 year olds provide some evidence that parent-reported callous and unemotional traits (from a psychopathy measure) load with the prosocial SDQ items but not the behavioral or hyperactive items (Dadds et al. 2005). This was not apparent in B-CAMHS04, however, where the magnitude of the correlation between the prosocial subscale and callous and unemotional traits was intermediate between the behavioral and hyperactivity subscales (Moran et al. 2009). This discrepancy between the Australian sample and B-CAMHS04 further highlights the need for replication of our findings across other large datasets with multiple informants and high-quality diagnoses.

Conclusion

To summarise, the SDQ has several attractive features including a brief format, comparable versions for parents, teachers and young people, and versions in over 60 languages (see www.sdqinfo.com). These analyses add to the evidence, however, that the hypothesised five subscales may not always tap distinct constructs. Our analyses further indicate that the optimal choice of subscales may depend on one’s study population and study aims. Our findings indicate that studies examining the broad constructs of internalizing and externalizing problems would be justified in using the SDQ to do so. Moreover, particularly in low-risk samples, this may be the more conservative approach in order to ensure an accurate description of what is being assessed and in order to generate findings which are comparable across informants. By contrast, using the five separate subscales may only be justified when seeking to study high-risk children, including those with mental disorder and/or with higher scores on the SDQ subscales.

Conflict of Interest

AG is a director of Youthinmind, which provides no-cost and low-cost software and web sites related to the SDQ and the DAWBA.

Copyright information

© Springer Science+Business Media, LLC 2010