INTRODUCTION

Eating disorders effect upwards of 30 million people and carry with them significant morbidity and mortality.1 Effective screening for eating disorders is critical as these disorders are commonly underdiagnosed and undertreated.1,2,3 The 5-item SCOFF (Sick, Control, One, Fat and Food; see Fig. 1) questionnaire, developed in 1999 by Morgan and colleagues, is the most widely used screening measure for eating disorders. With the inclusion of binge eating disorder and other specified eating disorders (i.e., atypical anorexia, low frequency or limited duration bulimia nervosa and binge eating disorder, purging disorder, night eating syndrome) in DSM-5,4 it has become increasingly important to expand awareness of various types of eating pathology. Of particular importance, these new categories of eating disorders had not yet been defined at the time that the SCOFF was developed.

Figure 1
figure 1

The SCOFF questionnaire.

The changing landscape of diagnostic eating disorder categories since the publication of DSM-5 highlights the importance of ensuring screening tools are appropriate for detecting the full range of eating disorders in the general population. To date, the SCOFF has been the recommended screening tool across numerous validation studies; however, these recommendations have not been systematically assessed. The purpose of this systematic review and meta-analysis is to evaluate whether the SCOFF can appropriately screen patients in the general population for the full range of eating pathology currently represented in DSM-5. To accomplish this, the literature was reviewed for studies that report the diagnostic test characteristics of the SCOFF.

METHOD

The Preferred Reporting Items for a Systematic Review and Meta-analysis of Diagnostic Test Accuracy Studies (PRISMA-DTA) guideline was followed in preparing this systematic review.5 This study is registered online with PROSPERO (CRD42018089906) and all search strategies and methods were determined before the onset of the study.

Search Strategy and Study Selection

We conducted a systematic literature search using PubMed from database inception through March 13, 2018. The search terms were “SCOFF” and “Feeding and Eating Disorders/Diagnosis” AND (“Psychometrics” OR “Sensitivity and Specificity”). Other search terms, such as “SCOFF questionnaire,” “Eating disorders screening,” and “Feeding and Eating Disorder/Diagnosis” AND “screening,” were attempted but revealed overlapping or extraneous results. The word “SCOFF” was searched for in all text of the Cochrane database in addition to these searches as this included the most comprehensive results in that database. Two reviewers (AMK and AGM) independently screened all abstracts generated from the subject search. Inclusion criteria specified that studies were published in English or were available in translation to English. To be included, it was required that validation information for the SCOFF could be derived from articles and included some specific demographic information (i.e., age range or standard deviation, gender, eating disorder diagnosis). The two independent reviewers (AMK and AGM) had high inter-rater agreement (Κ = 1) for exclusion of articles.

Data Extraction and Quality Assessment

The reviewers (AMK and AGM) used a standardized data collection form to extract data on date of publication, country in which the study was conducted, recruitment method, reference measure utilized, sample size, age, gender of sample, and race/ethnicity of the sample, participants’ average BMI and weight category, and percentage of sample with eating disorder diagnoses.

The reviewers (AMK and AGM) independently assessed study quality of all included studies using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool.6 Differences in assessment were resolved by consensus and inter-rater agreement was very high (K = 1). The original QUADAS-2 tool includes 4 domains: patient selection, index test(s), reference standard, and flow and timing. Given that there was little meaningful variation in the index test (i.e., the SCOFF) questions or administration, the index test domain was dropped from ratings. Specific rating criteria for each domain are presented in Appendix 1.

Data Synthesis and Analysis

Statistical measures of test performance (true positive, false positive, true negative, and false negative) were extracted from individual studies. Extracted test performance data can be found in Appendix 2. When sensitivity (true positive rate) and specificity (true negative rate) were reported separately for different eating disorder diagnoses or by gender, frequencies of the statistical measures for test performance were summed using data available in the manuscript. In two instances,7, 8 this data was not readily available or additional information from the authors was needed and corresponding authors on these manuscripts were contacted. Contacted authors provided data to calculate frequencies of the total sample (M. Tseng, personal communication, May 2018; S. Maguen, personal communication, November 2018). For all studies, statistical heterogeneity was estimated using the I2 statistic. Statistical heterogeneity provides an estimate of the amount of variance that is attributable to variability between studies. To account for variability across studies, subgroup analyses were conducted. Subgroups were prespecified based on study methodology (case study vs non-case study; type of reference standard used—interview vs questionnaire), study quality based on QUADAS-2 ratings, and patient characteristics (gender, age, sample type, and location). Statistical analyses were performed using STATA version 14.2 (StataCorp, College Station, TX). The meta-analytical integration of diagnostic accuracy studies (MIDAS) command was used to obtain figures and descriptive summaries and conduct subgroup analyses.9

RESULTS

Literature Search and Study Selection

A total of 984 abstracts were identified through the included databases and three were identified through bibliographies (Fig. 2). After 47 duplicates were removed, all titles or abstracts were reviewed for relevance. Following initial review, 882 records were excluded leaving 58 full-text articles for full review. Of these 58 articles, 33 were excluded for the following reasons: no validation information available, no reference standard included, article and data not available in English, and article representing re-publication of prior data or commentary on data.

Figure 2
figure 2

Study flow diagram of literature search.

Study Characteristics

Table 1 depicts the characteristics of included studies.7, 8, 10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32 The 25 studies reviewed included a total of 11,531 individuals. Thirteen unique countries were represented with 16 studies conducted in Europe, four in North America, three in Asia, and two in South America. Samples were recruited from three primary locations: medical settings (primary care clinics, specialty clinics), schools (grade school, high school, and universities), and the general community. Ages across studies ranged from 10 to 95, with the majority (n = 18) of studies including primarily adult samples and seven being conducted in entirely adolescent or young adult populations. Twelve studies included an entirely female sample and four additional studies included samples which were at least 70% female. The percentage of females included in the remaining eight studies ranged from 46.2 to 68%. Thirteen studies utilized interview format (SCID-I, CIDI interview, DSM-IV interview, EDE) for the reference standard. The remaining sixteen studies utilized various self-report measures including the EDI-3, the EDE-Q, the EAT-26, the Q-EDD, and an ICD-10 symptom rating scale. Eighteen studies reported on the percent of the sample which was diagnosed with an eating disorder based on the criterion used. The range of any eating disorder diagnosis, which included anorexia nervosa, bulimia nervosa, binge eating disorder, and eating disorder not otherwise specified, was 1.2 to 64.3%. The range of the sample that had each eating disorder diagnosis was as follows: anorexia nervosa = 0 to 32.1%, bulimia nervosa = 0 to 23.1%, eating disorder not otherwise specified = 0.4 to 46.3%, binge eating disorder = 0.4 to 11.6%. Sixteen studies explicitly reported on the percent of individuals with anorexia and bulimia. Eleven studies reported on the percent of the sample meeting criteria for eating disorder not otherwise specified and six reported on binge eating disorder. Aside from binge eating disorder, no studies explicitly examined validity for any of the other newly included eating disorders in the DSM-5.

Table 1 Characteristics of Included Studies

Other demographic characteristics were not frequently reported across all studies and thus are not included in Table 1. Only four studies reported any information about race or ethnicity, and samples tended to be primarily Caucasian (57.2 to 87.9%). Studies also did not frequently include information about BMI. The six studies reporting average BMI found that it ranged from 21.98 to 28.1. Of these, four studies included samples with an average BMI in the normal range (i.e., between 18.5 and 24.9).

Quality Assessment

Summary index scores for the QUADAS-2 are depicted in Table 2. Risk of bias and applicability concerns were rated as low risk/concern (depicted as “+” signs in the table), high risk/concern (represented as “-” signs in the table), and unknown risk/concern (depicted as “?” signs in the table). Only two studies were rated as low across all risk of bias and applicability concern domains. Risk of bias was high within patient selection across five studies. Four of these studies used a case-control design while one recruited an at-risk sample as opposed to utilizing a random or consecutive sample. Risk of bias was high in four studies in the flow and timing domain due to the SCOFF and reference standard being administered at different times (i.e., not sequentially) or if there was any ambiguity about questionnaires being completed at the same time (such as would be present in surveys that were mailed). In general, risk of bias was low across studies in the reference standard domain. Applicability concerns were most prevalent in the patient selection domain, with 14 studies having high applicability concerns. This was often because the sample utilized was restricted demographically (e.g., only included females). Applicability concerns were high across five studies for the reference standard. In these studies, certain subgroups of patients were not given the reference standard or were excluded for unknown reasons.

Table 2 Study Quality Assessment

Diagnostic Accuracy

Diagnostic accuracy rates for each study are depicted in the forest plot in Figure 3 and the receiver operating curve (SROC) in Figure 4. Pooled sensitivity was 0.86 (95% CI, 0.78–0.91) and specificity was 0.83 (95% CI, 0.77–0.88). The area under the curve (AUC) was 0.91 (95% CI, 0.88–0.93). Heterogeneity was statistically significant for sensitivity (I2 = 97.63; 95% CI, 97.17–98.09) and specificity (I2 = 98.22; 95% CI, 97.91–98.54). Differences in study methodology or clinical characteristics of the sample may result in elevated heterogeneity. Heterogeneity may also be elevated due to different thresholds utilized across studies to define cases. This was not the case for the SCOFF as the threshold effect was not significant (r = − 0.21; p = 0.32).

Figure 3
figure 3

Forest plot of included studies.

Figure 4
figure 4

Receiver operating curve (SROC).

In order to address the significant heterogeneity found across studies, subgroup analyses were conducted to examine the impact of methodological (i.e., case-control vs non-case-control; interview vs questionnaire reference standard) and clinical characteristics (age, gender, location, and diagnosis) on diagnostic accuracy. Table 3 presents pooled sensitivity, specificity, and heterogeneity values for each subgroup. The diagnostic accuracy of the SCOFF was higher in case-control studies (p < 0.01), when an interview was used as a reference standard as opposed to a questionnaire (p = 0.05) and when the percentage of women in the sample was larger than the percentage of men (p < 0.01). Additionally, diagnostic accuracy was higher when risk of bias was high for patient selection (p < 0.01). Sensitivity and specificity were lower in studies which included individuals diagnosed with BED; however, this difference was not significant (p = 0.22). Of note, subgroup analysis did not explain the high overall heterogeneity of the included studies as all subgroups had an I2 value of greater than 60%.

Table 3 Subgroup Analyses

The likelihood ratio scattergram (Fig. 5) shows the distribution of positive and negative likelihood ratios. The pooled positive likelihood ratio of 5.0 (95% CI, 3.6–6.8) suggests that the SCOFF is moderately helpful in detecting eating disorders. The negative likelihood ratio of 0.17 (95% CI, 0.11–0.27) suggests that the SCOFF is moderately helpful in ruling out the presence of an eating disorder.33, 34

Figure 5
figure 5

Likelihood ratio scattergram.

DISCUSSION

We conducted a meta-analysis of 25 validation studies on the SCOFF to determine whether this screen is a valid tool for identifying eating disorders in diverse settings and populations. Our in-depth examination of the SCOFF calls into question the effectiveness of this tool for eating disorder screening in primary care and community settings, with diverse populations, and with the full range of DSM-5 eating disorder diagnoses. This examination provides a critical context given that we also found in our study, as was found in a previous meta-analysis,35 that the SCOFF is an effective tool for identifying the presence of particular eating disorders (i.e., AN and BN) in the population for which it was initially developed (i.e., young women with eating disorder symptoms).

The purpose of screening is to capture the range of pathological eating and identify cases that might not be identified by other means. The SCOFF was originally developed and subsequently validated several times using case-control study designs. Case-control studies dramatically limit samples to a specific target population (i.e., cases and matched controls) and do not capture the diversity and range of disorders in the general population. Additionally, validity data from case-control studies may artificially inflate the efficacy of screening measures and lead to erroneous conclusions about the utility of the measure in the general population.6 As expected, our analyses revealed that the highest levels of sensitivity were found in case-control studies including young women diagnosed with AN and BN. These findings are important as higher rates of sensitivity and specificity in case-control samples highlight that when patients are at risk for AN and BN, the SCOFF is a highly robust screening measure. Conversely, studies with lower sensitivity rates were primarily recruited from community samples and included the highest reported rates of BED. Sensitivity was also lower in locations where rates of obesity tend to be higher (e.g., North America).

Comparing demographic variables across studies shows that while the SCOFF has been validated numerous times since its development, it is often validated in samples highly similar to the population in which it was initially validated (i.e., young women with AN and BN). In fact, of the 25 studies reviewed, more than half utilized a predominately or entirely female sample. Of the studies that did include males, only three were conducted using adults. In addition, many studies did not report on important demographic and clinical characteristics including certain eating disorder diagnoses (e.g., BED), race, and BMI. Of those that did report on these characteristics, there was evidence that samples utilized in these validation studies often did not reflect the racial and weight diversity seen across DSM-5 eating disorders outside of AN and BN. In addition, only six studies explicitly examined the efficacy of the SCOFF for identifying BED and none examined efficacy for any of the other specified eating disorders in DSM-5. Reflecting the lack of demographic variability in the samples across the 25 studies, applicability concerns were high in many studies on the QUADAS-2 risk of bias tool in the patient selection domain. Given these high applicability concerns, it is difficult to make conclusions about the appropriateness of using the SCOFF to screen for eating disorders with the exception of young women at risk for AN and BN.

Compared with a prior systematic review on this topic conducted by Botella and colleagues,35 the present systematic review provides a more in-depth and comprehensive analysis and, most importantly, includes an assessment of the quality of included studies. Additionally, ten new validation studies had been published and were included in our analysis for a total of 25 validation studies. As per PRISMA-DTA guidelines, this review also includes subgroup analyses. There were, however, several limitations to the current review. First, the literature search was limited to PubMed and Cochrane Library databases. Other databases were referenced in conducting initial searches; however, they were not included in the final search. Systematic reviews should include a range of databases as part of the final, systematic search strategy in general so that any possible articles are captured. The search was also limited to articles written or translated into English. These search limitations could have resulted in missing articles which might otherwise be included. With this being stated, there were no articles that the reviewers encountered that included validation of the SCOFF questionnaire and were inaccessible in English. This review was also limited to examining the validity of the SCOFF. A more comprehensive review of all eating disorder screening measures might have provided additional information regarding eating disorder screening; however, the SCOFF is the screening measure with the most extensive validity data and is frequently used in clinical practice. Another limitation was that we were unable to conduct subgroup analyses for other potentially relevant clinical characteristics (e.g., BMI, race, and ethnicity) as these variables were infrequently reported in the validation studies.

Conclusions

The current review was conducted to address concerns about the use of the SCOFF as a primary care screener for DSM-5 eating disorders, including BED and other specified eating disorders. Findings revealed that the psychometrics of the SCOFF are virtually unknown for the full range of DSM-5 eating disorder diagnoses and for diverse populations. The present review suggests that the SCOFF is a highly sensitive screening measure for young women at risk for AN and BN but analyses and quality assessment of studies raised concerns about the generalizability and reliability of these results for other eating disorder diagnoses. Currently, there is insufficient evidence to recommend the use of the SCOFF for large-scale screening in primary care and diverse community settings. This review identifies the need for the development of a new screening tool, or multiple tools, for validation for the full range of DSM-5 eating disorder diagnoses in heterogenous samples.