Diagnostic efficiency of the SDQ for parents to identify ADHD in the UK: a ROC analysis

Early, accurate identification of ADHD would improve outcomes while avoiding unnecessary medication exposure for non-ADHD youths, but is challenging, especially in primary care. The aim of this paper is to test the Strengths and Difficulties Questionnaire (SDQ) using a nationally representative sample to develop scoring weights for clinical use. The British Child and Adolescent Mental Health Survey (N = 18,232 youths 5–15 years old) included semi-structured interview DSM-IV diagnoses and parent-rated SDQ scores. Areas under the curve for SDQ subscales were good (0.81) to excellent (0.96) across sex and age groups. Hyperactivity/inattention scale scores of 10+ increased odds of ADHD by 21.3×. For discriminating ADHD from other diagnoses, accuracy was fair (<0.70) to good (0.88); Hyperactivity/inattention scale scores of 10+ increased odds of ADHD by 4.47×. The SDQ is free, easy to score, and provides clinically meaningful changes in odds of ADHD that can guide clinical decision-making in an evidence-based medicine framework.

with research diagnoses [15] (a perfect agreement would equate to a kappa of 1, and chance agreement would equate to 0). ADHD rates also change with age and with gender [16], making age and sex norms important potential moderators of test accuracy. Careful evaluation of the SDQ's performance in a nationally representative sample, evaluating its ability to identify ADHD based on semi-structured diagnostic interviews diagnoses that integrate information about school functioning, consistent with current nosological guidelines (DSM; ICD), is required.
The overarching goal of this study is to evaluate how the SDQ for Parents could help in the clinical identification of ADHD conceptualized as a discrete category, differentiating it from other sources of externalizing behavior. To do this, we evaluated how the SDQ [12] performs for ADHD screening in a nationally representative sample of children and adolescents from the UK [1]. Although this is not the first study analyzing SDQ population data [17], it is the first direct comparison of the ability of multiple SDQ scales (Total difficulties, TD vs. Hyperactive/ inattention, H/I vs. Conduct problems, CD) for discriminating ADHD cases. We expected the parent SDQ TD and the H/I subscale to outperform other SDQ subscales for detecting any ADHD disorder. Due to the gender and age differences in the rates of symptoms, we also looked at whether these changed the accuracy of the SDQ with regard to ADHD status. It would be parsimonious if the scales showed consistent accuracy [18,19] even though the mean scores might differ. If there were differences in accuracy, a nationally representative sample provides a good basis for establishing distinct sex or age-based norms. Finally, we followed the recommendations of evidence-based medicine and facilitated clinical application of the SDQ by estimating multilevel diagnostic likelihood ratios (DLRs; [20]) for SDQ score ranges to ease clinical application of the national norms to individual cases. DLRs are defined as the probability of a positive SDQ test result given ADHD divided by the probability of positive SDQ test result given non-ADHD.

Method Participants and procedures
The current study used the data from The British Child and Adolescent Mental Health Survey 1999 [21], which was designed to estimate the prevalence rates based on International Classification of Diseases-10 and DSM-IV criteria. A total of 18,232 children and adolescent (5-15 years old) were recruited from England, Wales, and Scotland (see recruitment strategy details in [1]).
Trained child and adolescent psychiatrists reviewed both the verbatim accounts and the answers to the Development and Well-Being Assessment (DAWBA; see Measurement section for further detail) [22] before assigning diagnoses. All diagnoses used in this study were unmodified DSM-IV current rather than life-time diagnostic criteria. Parents, teachers, and eligible 11-15-year-old children were invited to complete the SDQ [12], a 25 item questionnaire divided between five scales of five items each (see details in Measurement section).

Development and Well-Being Assessment (DAWBA; [22])
The DAWBA is a widely used semi-structured interview that involves child and parent interviews alongside a teacher questionnaire. The child/parent interviews and teacher questionnaires assess current and recent past psychiatric symptoms and their impact on functioning in children. The DAWBA is based on diagnostic criteria (ICD-10 and DSM-IV) and focuses on anxiety disorders, depressive disorders, ADHD and conduct disorders. A clinical diagnostic rating is informed by triangulation of these three sources.
The validity of the clinical diagnoses derived from the DAWBA have been demonstrated by concordance with case note screening in a clinical sample of children aged 11-15 years [23], and with a full clinical assessment for ADHD specifically [24].

Strengths and Difficulties Questionnaire (SDQ; [12]): parent version
The 25-item SDQ generates scores for five subscales confirmed through factor analysis [25]: emotional problems, conduct problems, hyperactivity-inattention, peer problems, and prosocial behaviors. A total difficulties score also sums all items. The hyperactivity-inattention scale is composed by two items about inattention, two items about hyperactivity, and one item about impulsiveness-the three key symptom domains for a DSM-IV diagnosis of attention-deficit/hyperactive disorder (ADHD) [26]. The parent SDQ demonstrated good concordance with teacher and child versions, and good test-retest reliability and internal consistency [25]. Validity was demonstrated by predictive validity and high specificity in terms of psychiatric diagnoses. Sensitivity was not as high. The present study focused on the global total difficulties score (TD), and the hyperactivity/inattention (H/I) and conduct problems (CP) subscales of the parent SDQ version as predictors of "any" DAWBA ADHD diagnosis.

Analytic plan
Nonparametric estimates of the area under the curve (AUC) from receiver operating characteristic (ROC) analyses quantified the diagnostic efficiency of the SDQ H/I, CD and TD subscale scores. A rough guideline for evaluating AUC values is: <0.70 poor, 0.70-0.79 fair; 0.80-0.89 good; and 0.90-1.00 excellent [27], although values higher than 0.90 in mental health contexts are often the result of design flaws such as comparing clinical cases to healthy controls [28].
AUCs were calculated for the target condition of any ADHD using SDQ subscale scores, to evaluate whether the TD or subscales scores (H/I and CD) were better able to discriminate youth with any ADHD disorder from other youth in the sample. Venkatraman's permutation test compared ROC curves [29]. Moderator analyses tested whether the diagnostic efficiency for the SDQ subscale scores changed significantly when comparing males and females, and youth age groups.
Finally, we calculated diagnostic likelihood ratios (DLRs) for optimal cut-points yielding the best balance between sensitivity and specificity from the ROC curves [30]. DLRs based on optimal cut-points provide clinically useful information for predicting the likelihood of a diagnosis. DLRs of less than 1.0 indicate that the observed score is associated with lower odds; DLRs of 1.0 mean that the score does not change the odds; DLRs between 2 and 5 are a small increase of the odds and potentially clinically meaningful; DLRs between 5.0 and 10.0 are a moderate increase, and DLRs greater than 10 are often clinically decisive [31].
All analyses were done using SPSS-Version 22.0 and pROC package in R [32]. Table 2 presents the participant demographics split by ADHD diagnosis. We report demographics for the ADHD combined group (n = 264) as well subgroups characterized by inattention (n = 110) and by hyperactivity (n = 35). Mean age and family size did not differ significantly across the groups. Half of the non-ADHD group was male, whereas this was significantly higher in each of the ADHD groups, comprising over 2/3 of the sample for each. Relative to the non-ADHD group, all three ADHD groups (combined, inattention and hyperactivity) had a significantly higher percentage of white children, single parent family background, parental unemployment and mothers with no educational qualifications. The ADHD groups had also experienced three or more life events in the past year. For clinical variables, the ADHD groups reported poorer child and parent health and family functioning, as well as higher rates of neurodevelopmental problems.

Diagnostic efficiency statistics
The AUCs for hyperactivity/inattention, conduct problems and total difficulties scales from the SDQ ranged from good (0.81) to excellent (0.96) in male and female subsamples and at different age ranges (Table 3a). Based on pairwise comparisons between paired AUCs between scales, H/I and TD outperformed the CP scale, except among males age 14-16 years. In the group of youngest males, H/I outperformed TD, contrary to the result observed on females of the same age group, where TD outperformed H/I. With the exception of older males, as predicted, H/I and TD outperformed the CP subscale for predicting any ADHD, and there were no major differences between H/I and TD performance, in spite of the greater number of items of the TD subscale.
No significant AUC differences were found between gender and age groups (p > 0.05) for the H/I subscale, supporting the use of a single set of cutoff scores for the entire sample.
A score of 5+ (from a possible range of 0-10) on the H/I subscale had a DLR of 2.3, and a score of 10 yielded a DLR of 21.3, reflecting a large increase in the post-test probability of any ADHD in this national community sample (Table 4).

An outpatient proxy clinical scenario: sensitivity analyses
As the sample composition of this national study could resemble the situation described by Youngstrom et al. [28], where high AUCs are the result of comparing a majority of healthy participants with a minority of clinical cases, sensitivity analyses focused only on participants who received a positive mental health diagnosis other than any ADHD in The British Child and Adolescent Mental Health Survey 1999 (e.g., n = 685 with any emotional diagnosis, n = 608 with any anxiety diagnosis, n = 136 with less common psychiatric diagnosis, etc.). These evaluated the ability of the SDQ to discriminate "which diagnosis" instead of a general "sick versus well" comparison. The AUCs for H/I, CP and TD scales ranged from poor (<0.70) to good (0.88) (Table 3b). This time, when running pairwise comparisons between paired AUCs between subscales, H/I consistently outperformed CP and TD subscale. CP and TD performance were fairly similar, showing a fair or poor performance discriminating any ADHD in this subsample. As with the full sample, no significant moderating effect of age and gender was observed.
For the comparisons limited to clinical cases, a score of 8+ (score range 0-10) in the H/I subscale produced a DLR of 1.8, and a score of 10 with a DLR of 4.47 (Table 4). Using an online calculator (http://araw.mede.uic.edu/cgibin/testcalc.pl) to combine the information yields a precise estimate of 57 %, and using the probability nomogram recommended in EBM provides a close, quick approximation (Fig. 1).
AUCs observed in this subsample appear similar to benchmarks from other samples that used outpatient referrals [19,33,34]; Table 1.

Discussion
Results showed that the SDQ for Parents is a statistically valid tool for discriminating cases with ADHD from those without ADHD among a national representative group of youths, as well as from children experiencing other mental health diagnoses in the UK. Accuracy levels were consistent with SDQ performance discriminating psychopathology reported by Stone's review [35], and with details provided in Table 1. Also, present results add evidence to previous findings [17], because they are based on a normative sample, addressing sampling limitations in prior work. Results from our sensitivity analysis, which focused only on participants with a positive mental health diagnosis, confirmed the SDQ as a valid tool to detect cases with ADHD among youths meeting criteria for other disorders. Again, prior studies established a plausible range of estimates, and the present work advances clinical utility using a representative normative sample to establish weights, providing a good estimate of performance in pediatric and general practice settings.
We extended prior work by adding pairwise comparison between SDQ scales in this large community sample, testing performance of H/I, CD, and TD scales for identifying any ADHD disorder. As hypothesized, the H/I and TD scales were significantly better than the CD subscale. It is notable that the H/I scale performed similarly to the TD despite its brevity. Furthermore, in line with previous reports, no significant differences in SDQ accuracy  between males and females were observed [18,19], nor did accuracy differ between age groups [19]. Evidence-based assessment is an important component of the diagnosis and treatment of mental health problems. It can help clinicians to improve the accuracy of their diagnostic decisions and limit the influence of the bias and heuristics on clinical judgment [36]. Incorporating actuarial methods as part of the assessment process enables clinicians to integrate multiple sources of data, improving the specificity of predictions made about diagnosis and prognosis [37][38][39]. The SDQ discriminates cases with any ADHD disorders from those with other disorders (as observed in sensitivity analyses), showing its utility as a component of the assessment process. The current study adds to the data indicating that, in addition to identifying youth with ADHD in a representative national community sample, the SDQ can also help to identify youth with ADHD in clinical samples. This is important, as the ability to distinguish healthy youth from youth with ADHD is not as helpful as being able to distinguish youth with ADHD from youth with ODD or other externalizing symptoms. It is also one of the first studies to provide nationally representative norms and weights, combined with a semi-structured diagnostic interview to provide the criterion diagnosis. In addition to being the largest study to date, the present work also used state-ofthe-art analytic methods to evaluate potential moderators of accuracy. It is also the first to present DLRs, which are crucial for clinicians to integrate the SDQ into the evidence-based assessment framework, integrating clinical findings in a way that directly guides decision-making for individual cases.
As an example, using McGee's mnemonic [40], a likelihood ratio of 4 increases the probability of any ADHD by about 25 %. For example, with a pretest probability of 23 % (estimated prevalence of any ADHD in this subsample of youths), and a cutoff score of 10 in the H/I subscale, the post-test probability of having any ADHD would 23 + 25 = 48 %, fairly close to the more precise estimate of 57 % obtained using a probability nomogram or calculator (http://araw.mede.uic.edu/cgi-bin/testcalc.pl) to apply Bayes' Theorem.

Limitations
Though the SDQ has clear clinical utility, the subsamples of youth with ADHD subtypes diagnosis (particularly hyperactivity, n = 35) limited our ability to test SDQ performance between ADHD subtypes. Although SDQ Teacher data were gathered, teacher ratings were used as a piece of evidence in establishing the formal diagnosis using the DAWBA; thus there would be criterion contamination that would exaggerate the apparent accuracy of teacher report because it contributed to both the predictor and the criterion [41]. Future studies of the SDQ in community samples should evaluate both measures in a design that avoids criterion contamination to explore whether one out performs the other and whether there is incremental validity in combining the two [42,43]. In this study, ADHD is conceptualized as a discrete category, an assumption that would be inconsistent with a dimensional conceptualization of ADHD. But, even when many aspects of a construct behave continuously, there are practical reasons to specify Step 1 indicates the pretest probability or estimated prevalence of a particular condition (23 % of any ADHD in this example).
Step 2 in the middle axis, carries information about the associated diagnostic likelihood ratio with a particular cut score (based on Table 4, a DLR of 4.47 is associated with a score of 10). Finally, Step 3 reflects the estimate post-test probability of having any ADHD diagnosis. If a different youth obtains a different score in the SDQ I/H subscale, for example a score of 8, the only correction needed to previous steps is the identification of the appropriate DLR in Table 4. Next, trace a new line starting at the same point (identical estimated prevalence), crossing the appropriate DLR as a Step 2, and reading the new estimated post-test probability in the last axis (see thin arrow) thresholds for dichotomous present/absent or treat/do not treat decisions. This is well established with both hypertension and obesity-the distribution of these is not bimodal, but thresholds are used for labeling and for treatment decisions [44].
Finally, ADHD can be conceptualized either as a source of group differences or as a constructivist variable. We note that even a constructivist definition of ADHD also has issues of reliability and measurement error. For example, patients could misread a checklist, or misconstrue the nature of the item. Clinicians frequently interpret the same responses differently-multiple studies have shown that even when presented with videotaped interviews [45] or vignettes with fixed content [46][47][48], clinicians apply the constructivist definitions inconsistently. Patients confront this regularly when they get a second opinion: One physician says "yes," and the other says, "no"… so does the person have the illness or not? Kraemer [30] talked about this as resulting in imperfect reliability and validity for the diagnostic criterion, and the medical testing literature has developed methods for dealing with missing or imperfect gold standards [49,50], recognizing that error can influence even categorical conditions with strong biological models.
The SDQ is a free, easy-to-use measure that has demonstrated utility as an ADHD disorder screening measure in community, and between youth experiencing mental health problems in the community in the UK. Current results suggest that elevated scores on the subscales of the SDQ increase the likelihood that an individual meets criteria for any ADHD, by a factor of more than 20 compared to healthy peers, and by a factor of 4.5 compared to other youths with commonly diagnosed mental health issues (as reflected by DLR in Table 4). From a clinician's perspective, this information can be very helpful in determining whether further assessment and/or treatment is warranted as well as informing selection between treatments. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.