The Eating Disorder Inventory (EDI) is a self-report questionnaire widely used both in research and in clinical settings to assess the symptoms and psychological features of eating disorders. The original version of the EDI was developed in 1983 by Garner, Olmsted, and Polivy comprising three subscales measuring eating disorder symptoms, i.e., drive for thinness (DT), bulimia (B) and body dissatisfaction (BD), and five more general psychological features related to eating disorders, i.e., ineffectiveness (IN), perfectionism (PE), interpersonal distrust (ID), interoceptive awareness (IA) and maturity fears (MF). In 1991, the EDI was enlarged from 64 to 91 items to measure additional general features related to asceticism (AS), impulse regulation (IR) and social insecurity (SI) (Garner 1991).

The EDI-2 discriminates reliably between patients and non clinical controls, and to some degree between patient groups (Garner 1991; Lee et al. 1998; Machado et al. 2001; Nevonen et al. 2006; van Strien and Ouwens 2003). However, cross-cultural differences have been detected (e.g., Waldherr et al. 2008; Steinhausen et al. 1992; Kordy et al. 2001; Lee et al. 1998; Tachikawa et al. 2004; Podar and Allik 2009; Clausen et al. 2009), which often manifest itself as a lack of psychometric measurement invariance across cultures. Both the original and the second version (EDI-2) have been used worldwide to screen for eating disorders in the general population, to measure treatment effect and outcome, as well as in routine clinical evaluations.

The EDI-3 represents an expansion and improvement of the earlier versions of the EDI. It consists of the same 91 questions as the EDI-2, including the same three subscales of eating disorder symptoms. The reliability of these index scores collected from eating disorder patients appears excellent (Cronbach’s α = .90–.97; test–retest r = .98) (Garner 2004; Wildes et al. 2010). Based on criticism regarding the factor structure of the EDI-2 (Limbert 2004; Muro-Sans et al. 2006; Welch et al. 1988) new factor analyses of the sum scores yielded new subscales more congruent with recent theory and research on eating disorders (Garner 2004). Thus the EDI-3 consists of the more general, though eating disorder relevant psychological trait subscales low self-esteem (LSE), personal alienation (PA), interpersonal insecurity (II), interpersonal alienation (IA), interoceptive deficits (ID), emotional dysregulation (ED), perfectionism (P), asceticism (AS) and maturity fear (MF). Also, three response style indicators have been added (Garner 2004). Moreover, the new version uses the six-choice format of the EDI-2, but scores were recalibrated from a 0–3 to a 0–4 format to expand the range of summative scores to improve the psychometric properties with non-clinical populations. This generally increases the variance of item scores, and possibly changes the covariance between items.

The EDI-3 revision yields adequate convergent and discriminant validity (Cumella 2006). Since 2004 many studies have used the new EDI-3 version, yet the present study is to our knowledge the first one to independently test the factor structure, the internal consistency as well as discriminative and cross-cultural validity. Furthermore, it is claimed that the factor structure of the EDI-3 captures important clinical aspects of the psychopathology of eating disorders; yet at present, no independent test of this factor structure has been conducted. A confirmatory factor analysis approach was taken in the present study to evaluate the factor structure. In the EDI-3 manual, Garner (2004) proposes to summarize the 12 primary EDI factors in two second order factors representing: 1) a general risk factor accounting for the three primary factors drive for thinness, body dissatisfaction, and bulimia, and 2) a general psychological disturbance factor accounting for the remaining nine primary factors. The purpose here was to examine the fit of this model. In addition, it was compared with four alternative models in order to evaluate the suitability of Garner’s model. First, it was compared with a base model specifying all factors to be independent (a null-correlated model), which it obviously should outperform. Then it was compared with a simpler second-order model specifying only one general factor, and thereafter with a more complex second-order model specifying three general factors. In the latter model, the nine primary psychological disturbance factors was split up further in two disturbance factors: one for the factors emotional dysregulation, perfectionism, ascetism and interoceptive deficits (representing rigidity and inflexibility), and another for the factors maturity fear, lack of self esteem, feelings of insecurity, personal alien and interpersonal alienation (representing insecurity and estrangement). Finally, it was compared with a correlation model allowing all 12 factors to covary. Generally, correlation models always outperform second-order factor models, but at the expense of being much more complex. If the fit of the second-order model is not substantially worse than a correlation model, the former is to prefer. Finally, a random model was specified just to check the trustworthiness of the preceding factor models. All factor models should of course outperform the random model.

In many epidemiological studies cut-off scores on for instance the drive for thinness subscale have been used for screening purposes, but the psychometrically based rationale and general empirical support for such cut off scores may be questioned. Another important objective of the present study was to examine the sensitivity and the specificity of the EDI-3 subscales.

Using a large patient sample and a representative sample from the general population, the present study aims 1) to establish national norms for the EDI-3 and to compare them with the US and international norms provided in the EDI-3 manual, 2) to test the internal consistency of the subscales, 3) to test the primary and second-order factor structure of the EDI-3, and 4) to examine the diagnostic accuracy for each subscale by estimating the sensitivity and the specificity of cut off scores.

Method

Subjects and Procedure

Female patients were recruited from the eating disorder centre at the Aarhus University Hospital in Denmark. Recruited patients were mainly given outpatient treatment. The inclusion criteria were age = 18 (M = 24.8, SD = 5.7, range 18–54 years), and a full or partial DSM-IV diagnosis of anorexia nervosa (AN) or bulimia nervosa (BN) determined by the Eating Disorder Examination (EDE) (Fairburn and Cooper 1993). Partial AN/BN were defined as moderate to severe eating disorders with incomplete fulfilling of diagnostic criteria, i.e., not having amenorrhea, or having eating disorder symptoms with a lower frequency or shorter duration. This is similar to the DSM category “Eating Disorders Not Otherwise Specified” (EDNOS). Patients with a binge eating disorder, a BMI = 30 or who missed two or more questions in at least one of the EDI subscales were excluded. All patients were examined for comorbid psychiatric and medical disorders, the results of which are reported elsewhere (Clausen 2008; Godt 2008). Comorbidity was not an exclusion criterion if the eating disorder was severe enough to stand out as the main diagnosis. Of the active sample (N = 561), 84 had AN of whom 56 with the restricting subtype and 28 with the bulimic subtype, respectively, 202 had BN, and 275 had partial AN/BN.

Non clinical controls (N = 2000) comprised women aged 18–30 years, selected from the Danish Civil Registration system representative of the Danish female population. They were invited, by letter, to complete the Danish version of the EDI and additional questions on paper or through the internet. 935 females responded, and 57 were excluded because they missed more than one question in one of the subscales, leaving us with a final sample of 878 respondents (44%, N = 2000). Of patients included mean age was 25.8 (SD = 3.6), mean BMI was 23.3 (SD = 4.5), 16 (1.8%) had BMI < 17.5 and 51 (5.8%) had BMI > 30. No interviews were performed to determine formal eating disorder diagnoses and no controls were excluded because of possible eating disorders.

In agreement with the Psychological Assessment Research, Inc., the official Danish version of the EDI-3 was translated/back-translated to ensure comparability. The study was approved by the ethics committee of Region of Central Jutland and the Danish Data Protection Agency.

Statistical Analyses

Descriptive, Inferential and Effect Size Statistics

Statistical analyses were carried out using SPSS version 15.0. Group comparisons were performed using t-tests. Effect sizes were calculated according to Cohen (1988) using a Cohen’s d of .80, .50 and .20 to indicate a strong, medium and a small effect, respectively. One-way analyses of variance (ANOVA), with Tukey’s post hoc tests, were used to compare diagnostic groups. Cronbach’s alpha was used to estimate the internal consistency of the subscales using values =.70 as the criterion for acceptable consistency.

Confirmatory Factor Analyses

The confirmatory factor analyses were run in LISREL v8.80. The fit of the different factor models were evaluated according to criteria from Hu and Bentler (1999) and Marsh et al. (2004), using RMSEA values below < .06 and CFI values above > .95 to judge a model as an acceptable approximation to real data. Chi-square tests were not used to judge absolute fit of the models, but rather to compare which factor models best reproduced the observed correlation matrix. As most models were non-nested, they were compared using the Akaike’s Information Criteria (AIC), which is based on the chi-square index but adds a penalty for more complex models. Lower values indicate a better fitting model.

As the EDI scores were negatively and heavily skewed, especially in the non clinical control sample, item scores were normalized in PRELIS to reduce non-normality. This resulted in a less severe skew in the control sample (Mardia’s multivariate kurtosis dropped from 110.9 to 97.7), but it did not affect the non-normality in the patient sample (from 70.4 to 70.3). Adding a Satorra-Bentler correction matrix to correct the standard errors was not possible due to the large number of items. Hence, comparisons of factor models were done within rather than between groups.

Analyses of Sensitivity and Specificity

The possibility of using EDI-3 cut off scores to determine a clinical diagnosis of eating disorder was evaluated by conducting an analysis of sensitivity and specificity. Sensitivity indicates the proportion of true positives correctly identified (a sick patient diagnosed as sick), while specificity indicates the proportion of true negatives correctly identified (a healthy person not receiving a diagnosis). These proportions change as the EDI cut off score is moved up or down, and may be expressed as a ROC curve describing the diagnostic discriminatory ability across the whole range of EDI cut off scores. At one point of the ROC curve, the sensitivity and the specificity are at a maximum. The Youden’s index was calculated to identify this point (Youden 1950), which is the point of the curve lying farthest away from the diagonal (chance) line. As the maximum Youden’s index turned out to be relatively stable over a small range of cut off scores, two cut off scores are presented: a) primarily, the cut off score tied with the maximum Youden’s index, and b) the cut off score for a Youden’s index being .02 points lower than the maximum value (thus having almost comparable discrimination properties, but with different sensitivity and specificity estimates). A change window of .02 was used as the Youden’s index dropped down markedly if moving the cut off score one point further. The area under the curve (AUC) values indicates how well a particular EDI subscale detects an eating disorder, i.e., its discriminatory ability. AUC is reported for each subscale within each diagnostic group. A no-discriminatory test has an AUC of .5, while a perfectly discriminating test has an AUC of 1.0. A common conception is that AUC > .70 is fair, > .80 is good, while > .90 is excellent. A non-parametric method of constructing standard errors was used. As the EDE interview (Fairburn and Cooper 1993) was used for the patient sample only, the estimates of diagnostic accuracy are inflated if left uncorrected. This was solved by assigning a particular diagnosis of eating disorder at random to individuals in the control sample according to the prevalence rate for that particular disorder. In this study we used the generally accepted prevalence rates from two-stage community studies of 0.3% for AN, 1% for BN (Hoek and van Hoeken 2003), and 2.4% for partial AN and BN (Machado et al. 2007).

Due to the high ratio of patients to controls in the present study, yielding strongly upwardly biased base rates, positive and negative predictive values are not reported. Instead, likelihood ratios (LR) are reported indicating the chance ratio of a positive test result in diseased individuals (true positives) to that of a positive test result in non-diseased (false positives). An unbiased post-test probability of having an eating disorder, given a specific cut off score, can be calculated by multiplying the LR with prevalence odds (Akobeng 2007), or put in a Fagan’s nomogram.

Results

National Norms, and Comparisons with the US and International Norms

All EDI-3 subscales discriminated significantly (p < .001) and strongly (Cohen’s d ranging from .71 to 2.00) between patients and non clinical controls (see Table 1).

Table 1 Subscale sum scores for patients and normal controls

Table 2 shows that the mean scores of the EDI-3 subscales were different between the three diagnostic groups. Differences between all three groups are found only on the B subscale. However, each diagnostic group differentiates from the others as follows. AN patients scored higher on the MF subscale than patients with BN (d = .33) and a partial AN/BN (d = .34). BN patients had higher scores than patients with AN and a partial AN/BN on the DT (AN: d = .68, AN/BN: d = .75), B (AN: d = 1.74, AN/BN d = 1.39), and BD (AN: d = .96, AN/BN: d = .74) subscales. BN patients displayed higher scores than partial AN/BN on the LSE (d = .33). Patients with partial AN/BN scored lower compared to AN and BN patients on the PA (AN: d = .22, BN: d = .23), ID (AN: d = .33, BN: d = .38) and AS (AN: d = .33, BN: d = .33) subscales. The three diagnostic groups were comparable on the subscales II, IA, ED, and P.

Table 2 One-way analysis of variance for differences between diagnostic groups, significance of F and Tukey’s post-hoc test

Overall, Danish control norms were significantly lower than international norms (see Fig. 1) and especially US norms, but the effect sizes were small on the subscales LSE (d = .21), II (d = .27), IA (d = .47), and MF (d = .38) whereas the remaining differences were even lower (d < .20). Differences between Danish and US norms were significant in all subscales (p < .001) and effect sizes were large (d = .99) for the P subscale, medium for the II (d = .55) and IA (d = .54) subscales, and lower (d > .20) for the DT (d = .36), B (d = .43), BD (d = .41), PA (d = .46), ID (d = .29), ED (d = .38), AS (d = .49), and MF (d = .48) subscales.

Fig. 1
figure 1

Norms of Danish controls vs. international controls. Note. Gray area displays the international norms (M ± 1 SD) (Garner 2004) and the error bars the Danish norms (M ± 1 SD)

Figure 2 illustrates that compared to international norms, Danish patients display significantly lower scores (p < .01) on all but two of the general subscales (i.e., the ID and ED subscales). However, effect sizes were small for the LSE (d = .16) and AS (d = .18) subscales, small-to-moderate for the PA (d = .37), II (d = .27), P (d = .20), and MF (d = .26) subscales, and medium for the IA (d = .50) subscale. In the manual (Garner 2004), means for the subscales DT, B, and BD are only reported for sub-groups of patients and not the total patient population, therefore not included in Fig. 2.

Fig. 2
figure 2

Norms of Danish patients vs. international patients. Note. Gray area displays the international norms (M ± 1 SD) (Garner 2004) and the error bars the Danish norms (M ± 1 SD)

Compared to US norms, five of the nine general subscales were lower in the Danish sample, with a medium effect size on the P subscale (d = .53), a small effect size (d = .20) on the II subscale, and a less than low effect size on the subscales PA (d = .19), IA (d = .16), and AS (d = .15). The ID and ED subscales yielded higher scores in the Danish sample (i.e., ID d = .23, and ED d = .16). No significant differences were found on the LSE and MF subscales. Moreover, on the eating disorder specific subscales (i.e., DT, B, and BN) the mean scores for the subtypes restrictive (AN-R) and bulimic (AN-B) did not differ substantially from international and US norms. Danish AN-R-patients score higher on the B subscale compared to international and US norms, but the effect size was small (d = .23 and .32). For BN patients, Danish norms on the BD subscale were significantly higher than US (d = .30) and international norms (d = .41). Compared to international and US norms, Danish patients with partial AN/BN had lower BD (d = .31 and .35) and DT scores (d = .40 and .51).

Reliability of the EDI-3 Subscale Sum Scores

The internal consistency of the item scores was satisfactory for patients as well as controls (see Table 3), except for the AS subscale for controls. Seven of 12 subscales for patients, and eight of 12 subscales for controls, showed an α value > .80.

Table 3 Reliability estimates (Cronbach’s Alpha) of EDI-3 subscale sumscores for patients and normal controls

Confirmatory Factor Analyses

The confirmatory factor models were examined separately for the patient and the non clinical control sample as there is reason to expect that healthy and mentally ill individuals may attach somewhat different meanings to the same set of questions. Items were specified to load on twelve primary latent factors, according to the manual (Garner 2004). However, different ways of specifying the relationships between the factors were tested out. The base model (M1) specified 12 independent or uncorrelated factors, which fitted the data least well as expected (see Table 4 for model comparisons). The second model specified a single general factor (M2) explaining the covariance among the 12 primary latent factors, which improved model fit according to all fit indexes. A third model (M3) defining two second order factors, nine for the psychological factors and three for the risk factors (bulimia, drive for thinness and body dissatisfaction), fitted the data better in terms of χ2 and the AIC index. A tentative alternative model (M4) specifying one general risk factor, and two general psychological disturbance factors (one latent factor for emotional dysregulation, perfectionism, ascetism and interoceptive deficits, and another latent factor for the remaining disturbance factors) slightly improved model fit according to AIC. The best fitting model was, however, a 12 factor model (M5) allowing all factors to correlate freely. To test the trustworthiness of the preceding model specifications for the covariance data, a random model (M6) specifying the 90 EDI items to load on the 12 respective factors in an unsystematic fashion produced a poorer absolute and relative fit, as expected.

Table 4 Comparison of factor models in eating disorder patients and normal controls

Summarized, the correlated 12 factor model received best support. However, a more parsimonious second-order model, which has a much simpler factor structure than the correlation model, is to prefer if a worsening of fit is not substantial, which it was not. The difference in fit between the second-order models M2-M4 was negligible in terms of the RMSEA and the CFI. Two observations speak for favouring model M3. Firstly, the improvement in fit was larger when moving from model M2 to M3, rather than from model M3 to M4, especially in the control sample. Secondly, an examination of the factor correlations among the three general factors indicated an extremely high correlation between the two psychological factors in model M4 (.88 and .95 in the samples, respectively), while the correlations between the risk and the combined psychological factors in model M3 were substantially lower (.54 and .60). Taken together eating problems should be summarized in two main scores to differentiate eating problems: one representing a risk factor and another representing a psychological disturbance score, according to the author (i.e. Garner 2004). The factor loadings from second order factor analysis of model M3 are displayed in Table 5. At the same time the fit of the preferred model (M3), as well as the best model (M5), was not great according to the RMSEA index. Although the RMSEA was lower than < .06 as recommended by Hu and Bentler (1999), hence indicating a reasonably approximation of the model to the observed data, there is room for improvements. The relatively large number of chi-squares compared to degrees of freedom, is not reassuring either. Following an inspection of the modification indices, the mediocre fit appears related to several items showing hugely correlated residuals as well as significant factor side-loadings. Hence some of the EDI items do not have adequate psychometric properties.

Table 5 The factor loadings for the second order two factor model (M3 in Table 4) with risk and psychological disturbance as general factors accounting for the 12 primary factors. The two general factors were allowed to correlate

Sensitivity and Specificity

ROC curves for all EDI-3 subscales were expressed for a diagnosis of AN, BN and partial AN/BN (see Figs. 3, 4 and 5). In each figure, the subscales with the highest AUC (Area Under Curve) value are listed first. The figures show that the interoceptive deficits subscale is the best predictor across all diagnostic groups, followed by low self-esteem and personal alienation. The bulimia subscale comes sixth overall, but is an excellent predictor of a diagnosis of BN with high sensitivity and specificity estimates. Table 6 provides an overview of sensitivity, specificity, likelihood ratios and diagnostic accuracy of the three best and the worst predictors within each diagnostic group. The cut off score for deciding these estimates was based on the highest value on the Youden’s index. As several of the subscales changed the Youden’s index minimally by either lowering or increasing the cut off, alternative cut off scores are also reported in the direction with the smallest change in the Youden’s index. Generally, increasing the cut off increases the specificity and reduces misclassification, but at the cost of increasing the number of false negatives (patients not detected), which represents a more serious error. Most ROC curves across the diagnostic groups are quite parallel over all levels of cut off scores, but with one notable exception. As expressed in Fig. 3, the subscale of body dissatisfaction is the worst of all subscales in overall diagnostic accuracy of AN. However, at low cut off scores (<6) it definitely is the most sensitive subscale in detecting true cases of AN, though performing poorly with regards to specificity (<.22).

Fig. 3
figure 3

ROC Curves for a Diagnosis of Anorexia. Note. AUC = Percent of total area under ROC curve. A low cut-off score starts in the right upper corner, going down the diagonal

Fig. 4
figure 4

ROC Curves for a Diagnosis of Bulimia. Note. AUC = Percent of total area under ROC curve. A low cut-off score starts in the right upper corner, going down the diagonal

Fig. 5
figure 5

ROC Curves for a Diagnosis of Partial AN/BN. Note. AUC = Percent of total area under ROC curve. A low cut-off score starts in the right upper corner, going down the diagonal

Table 6 Sensitivity, specificity, likelihood rates and diagnostic accuracy of the three best and the worst EDI-3 subscales for each diagnostic group

Conclusion

Overall the new version of the Eating Disorder Inventory (EDI-3) stands out as a psychodiagnostic assessment tool that may be used to capture eating problems. Apart from one subscale with a medium effect size difference, all differences between the patient and the non clinical control group yielded high effect sizes, and even slightly higher than using the EDI-2 (Clausen et al. 2009). Thus the discriminative validity is good, as is the case for internal consistency (Table 3). The latter is even better in this study than in the original development of the EDI-3 (Garner 2004). Thus, one argument for creating a new EDI-version (i.e., a more consistent measure) is supported by our findings.

Through the present large scale study, national norms have been successfully established. For practical purposes, the implications from this study are that outside the US, the international norms (Garner 2004) may be used for screening purposes when national norms are lacking. On the other hand, the lack of national clinical norms may lead to a more valid comparison to US than international data. This certainly creates practical problems in doing epidemiological research, and may point to variations in how a psychological phenomenon (e.g., eating disorder problems) appear in various cultures and populations, as well as to psychometric challenges in increasing the construct validity.

The confirmatory factor analyses by and large supported the grouping of eating problems in two general factor scores, one assessing a risk component and the other assessing associated psychological disturbances. The model fit in the present study was actually better than what was presented in the EDI-3 manual (Garner 2004). One reason for this may be that Garner based his analyses on twelve subscale sum scores rather than 90 item scores, as was done in the present study. Still, the model fit was in the upper window of what is regarded as a minimal acceptable model approximation. This may be explained by the fact that several items had poor psychometric properties according to the modification indices provided by LISREL, showing hugely correlated error covariances and significant factor side-loadings. These items are thus ambiguous indicators of eating problems, and should be revised or removed in a future version of the EDI-3. Identifying these items requires, however, an extensive item level analysis followed by a cross-validation on a holdout sample. This is a task for another paper.

In our study the sensitivity and specificity estimated make the bulimia subscale an excellent predictor of a BN-diagnosis. However, compared to the EDI-2, the EDI-3 version of this subscale contains only one new item. However, the overall purpose of the EDI-3 was to compose subscales with a conceptual content more congruent with domains identified by modern thinking about the nature of eating disorders (Garner 2004). The ROC-analyses support the success of this purpose in the sense that the subscale interoceptive deficits is the best predictor across all diagnostic groups, followed by low self-esteem and personal alienation. Previous studies have also found interoceptive awareness along with the three eating disorder specific subscales to discriminate between eating disorder patients and psychiatric controls (Nevonen et al. 2006; Schoemaker et al. 1997). Also, interoceptive issues are related to other psychological constructs of eating disorders like depression, perfectionism, and self directiveness (Fassino et al. 2004). Hence, interoceptive deficits stand out as a concept with a high discriminative and construct validity related to eating disorders. An important implication from the present findings is that the current use of the drive for thinness subscale as a screening tool in epidemiological studies is clearly not warranted any more. While people scoring high on drive for thinness may do this for good as well as bad reasons even unrelated to the pathology of eating disorders, disturbance in the accuracy of perception or recognition of bodily states is an important pathognomic sign of the specific eating disorder psychopathology, commonly seen as a failure to recognize signs of hunger (Bruch 1962). Also noted by Bruch (1962) the all-pervading sense of ineffectiveness in patients with AN may be well captured by the EDI-3. This kind of ineffectiveness may reflect a personality development attributed to a failure of confirmation of child initiated behaviour.

As the ROC-analyses indicated that the EDI-3 is highly suitable for screening purposes, a future study aimed at finding the most optimal items for screening purposes is clearly indicated as well. The present study suggests that the current version of the EDI has equally well, if not better, psychometric quality compared to what the author (i.e. Garner 2004) has reported. One caveat of performing new factor analyses on an item level to identify psychometrically poorly working items, is that removal (or revision) of items will affect the current estimates of sensitivity and specificity.

The strength of the present study is the use of a large control sample of women stratified from the general population. This stands in contrast to the common practice of using smaller student samples as controls, where the questionable representativity of the general population may deflate the external validity of the findings. Using a large population sample creates on the other hand problems in terms of case detection. Another caveat of the study is that the recording of medical complications, notably in the anorexia nervosa subsample, was incomplete, which may increase the risk of inflated scores due to the impact of malnutrition. On the other hand, this problem is relevant for less than 10% (N = 561) of the active sample.