Introduction

The prevalence and impact of eating disorders (EDs) continues to increase worldwide [2, 15, 32, 73]. Currently, the Diagnostic and Statistical Manual of Mental Disorders [1] outlines the criteria for several disorders, including but not limited to anorexia nervosa (AN), bulimia nervosa (BN), binge eating disorder (BED), and other specified feeding and eating disorder (OSFED). The functional impact these disorders has been well established, with existing literature highlighting that EDs are associated with higher medical and psychiatric comorbidity, reduced quality of life and increased risk of mortality [2, 90, 91]. Despite the existence of various treatment models, and some evidence of increased treatment seeking [73], treatment outcomes are varied, and efficacy typically only achieves modest gains [92, 93]. As such, examining the key components implicated in both the development and maintenance of EDs remains critical.

Theoretical ED models emphasise the importance of ED relevant cognitions as both development and maintenance factors [21, 27, 94]. Cognitions implicated in the development and maintenance of EDs include thoughts, expectations, assumptions, and conditional and unconditional beliefs. From a transdiagnostic perspective, Fairburn et al.’s [27] cognitive-behavioural model of EDs highlights the role of over-evaluation of eating, weight, shape and their control, core low self-esteem, as well as emphasising the role of unconditional, often pervasive, and negative views of the self (e.g., negative, self-referent beliefs [‘I am unlovable]). Core low self-esteem and negative self-beliefs contribute to an increasingly dysfunctional schema for self-evaluation and thereby contribute to the over-evaluation of eating, weight, shape, and their control. The over-evaluation of these features and their importance is often central in the initial emergence and maintenance of ED behaviours, such as restriction, purging and binge eating. Cognitive-behavioural therapy for EDs (CBT-E), the transdiagnostic treatment derived from this theory, has been shown to be an effective and efficacious treatment option for a range of ED diagnoses, in adults and adolescents [3, 24].

The schema-focussed model of EDs [94] clearly delineates the role of unconditional core beliefs or schema level representations in the development of different eating pathology. It suggests that in AN, schema compensation occurs to prevent experiencing negative affect, resulting in restrictive eating behaviours, whereas avoidance of negative affect generated by the activation of negative core beliefs results in more bulimic-type pathology [94]. Schema content may include conditional and unconditional beliefs about the self, others, or the world across several dimensions and schema domains. For example, Waller et al. describe the relationship between unconditional beliefs regarding the self (i.e., self-referent beliefs, such as ‘I am worthless’), others (e.g., ‘Others are judgemental and harsh’), and possible maladaptive schemas relating to these beliefs (e.g., ‘abandonment’ or ‘unrelenting standards’). Narrative and systematic reviews of schemas and schema therapy outcomes have indeed highlighted associations between pronounced early maladaptive schemas (EMS) and EDs [52, 70], and indicated that schema therapy interventions show some promise for treating complex eating presentations [52].

Additionally, several disorder specific models highlight the importance of core, self-referent beliefs, as well as other types of ED cognitions. The cognitive model of bulimia nervosa suggests that maladaptive, negative self-beliefs (e.g., ‘I am a failure’; [21] act as a predisposing factor, leading to increased negative automatic thoughts (NATs), which in turn serve to further reinforce these beliefs. The model also outlines the role of permissive, positive and negative beliefs about eating in the maintenance of binge eating and purging behaviours. These beliefs can be understood as types of meta-cognitive beliefs, that is, a set of higher order beliefs reflecting understanding, awareness and interpretation of one’s own thought processes [97]. For example, a positive meta-cognitive belief about eating might be ‘eating helps me to control my emotions’ [13]. An integrated cognitive-behavioural model of binge eating similarly identifies both the critical roles of core low self-esteem (negative core beliefs about the self) and eating beliefs (that is, meta-cognitive beliefs) in the development and maintenance of binge eating [11].

Models by Fairburn et al. [27] and Cooper et al. [21] both emphasise the role of NATs that reflect the over-evaluation of eating, weight, shape and its control in the maintenance of EDs. For example, negative thoughts about food and eating may be statements such as ‘I hate that I like to binge’, versus negative thoughts about weight and shape that may include content such as ‘I’ll gain a huge amount of weight’ or ‘I think my stomach is too big’. In accordance with previously described models [11, 21, 27], these negative and often automatic thoughts serve to maintain ED cycles by increasing negative affect or impacting emotional regulation, which in turn reinforce ED behaviours. These negative thoughts also ultimately reinforce more pervasive underlying cognitions (i.e., core, self-referent beliefs).

Although the aforementioned models display some differences in their cognitive emphasis (in both content and types of cognitions), each highlights the fundamental nature of ED cognitions as factors in ED developmental and maintenance processes. As such, it is vital that ED cognitions continue to be assessed and monitored as therapeutic targets and treatment outcomes. Both theoretical and empirical research has supported the idea of measuring ED ‘thoughts’ separately to behaviours, as well as assessing ED cognitions as a continuous construct that is often identified in sub-clinical or prodromal populations [53], where there is need to be proactive in prevention and early intervention. It is vital therefore to utilise assessment tools measuring ED cognitions that possess strong psychometric properties.

Previously, Burton et al. [9] conducted a systematic review of the psychometric properties of self-report measures relating specifically to binge-eating symptoms. Further, two recent systematic reviews reported on the available measures and facets relating to body image [48, 69]. However, each of these reviews did not provide a comprehensive summary relating to all EDs and related features more broadly. Another recent review summarised the instruments utilised in the assessment of EDs in adults [77], providing an overview of commonly used and recently developed measures of ED symptomatology. However, the focus of this review was only on frequently used and recently developed assessment measures. It did not specifically report on cognitive measures, that is, those with focus on ED beliefs, expectations, assumptions and thoughts. As such, several important existing instruments that have been developed to assess ED relevant cognitions were not reviewed. Some examples include the Eating Beliefs Questionnaire-18 (EBQ-18) [10], and the Eating Disorder Core Beliefs Questionnaire (ED-CBQ) [28]. Moreover, a thorough evaluation of the psychometric properties of the identified measures using a standardised, published tool for assessing their quality was not conducted, instead, the focus of the review was informative rather than evaluative [77].

Thus, to date, there has been no comprehensive assessment of the available self-report measures of ED cognitions or a thorough assessment of their psychometric properties. The aim of this review was to systematically identify and evaluate the psychometric properties of existing self-report measures of ED cognitions. The psychometric properties of these cognitive self-report measures will be evaluated using the appraisal of adequacy tool, developed by Terwee et al. [87]. This standardised tool guides quality appraisal by using nine quality criteria, including content validity, internal consistency, criterion validity, construct validity, reproducibility (agreement), reproducibility (reliability), responsiveness, floor or ceiling effects, and interpretability. This tool has previously been utilised in Burton et al. [9] review of self-report measures of binge-eating symptoms, as well as several other systematic reviews in other domains [59, 84, 101]. Using these criteria, this study intends to systematically summarise the available findings for cognitive ED self-report measurement tools, providing a comprehensive understanding of their psychometric properties, and guidance for researchers and clinicians in evaluating, comparing and utilising these measures.

Method

Search strategy

A systematic review protocol was registered using the international prospective register of systematic reviews (PROSPERO; CRD42023440840). The search strategy followed guidelines outlined in the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA; [66]. Utilising guidelines for optimal database combinations for literature searches [8], the search was conducted using three electronic databases: PsycINFO, MedLine, and Embase. There were no limits to search based off publication period. Reference lists of all included studies were scanned to identify any additional, relevant publications. Searches were run again prior to final analysis on the 17/08/2023. To identify eligible studies, several combinations of keywords were used that related to EDs (e.g., “eating disorder”, “anorexia nervosa”, “bulimia nervosa”, “binge eating disorder”, etc.), self-report measures (e.g., “questionnaire”, “scale”, “tool”, “assessment”, “measure”, etc.), and psychometric properties (e.g., “psychometric”, “reliability”, “validity”, etc.). A comprehensive search was conducted, including search of titles, abstracts, and keywords, subject headings were mapped, and in some instances the explode function was utilised for expansion of relevant results. The full search strategy created for all three databases is a publicly available supplementary file included in the systematic review protocol registered on PROSPERO.

Inclusion and exclusion criteria

The inclusion criteria were as follows:

  1. 1.

    The study had to describe the development, validation and/or the psychometric evaluation of a self-report measure of ED cognitions.

  2. 2.

    The measure (or subscale) was specifically developed to solely assess ED cognitions (that is thoughts, expectations, assumptions, or beliefs).

  3. 3.

    The measure was developed and administered in the English language, to native English speakers, and published in the English language in a peer review journal.

  4. 4.

    Utilised an adult population (17+, clinical or general) for development, evaluation or investigation purposes.

The exclusion criteria were as follows:

  1. 1.

    Non-psychometric studies (such as literature reviews, systematic reviews, or meta-analyses).

  2. 2.

    Measure (or subscale) was designed for purposes other than assessing ED cognitions (such as, as a screening or diagnostic tool, or that assesses behaviours or emotions).

  3. 3.

    Measure not in English, administered in English, or published in a non-English speaking country.

  4. 4.

    Utilised a child or adolescent population.

  5. 5.

    Book chapters, non-peer reviewed publications, published doctoral theses.

Selection process

Articles were screened and selected by two independent reviewers (AH and SB). Using the identified databases, duplicates were identified and removed, and articles were screened by title and abstract for inclusion/exclusion by AH. During title and abstract screening process, a series of meetings were held with the four included authors, to provide consensus about the relevance of measures that were accessible during this stage. A similar process was utilised by a previous psychometric systematic review of body image [48]. Full texts of the remaining studies were obtained after the initial screening, then both reviewers analysed each text independently to establish the final texts to be included. Measures (and subscales) were also assessed for their relevance and eligibility at this stage of the screening process, if the measure was not available (at the item level) at the stage of title and abstract screening. Studies (and therefore measures) that required further scrutiny to assess whether they met inclusion criteria were included in full text screening in order to assess their relevance and eligibility at the subscale and item level. A final consensus meeting was conducted where all authors provided agreement as to the relevance of all included measures.Footnote 1 The overall agreement between the two reviewers was 96.7%, which equates to an inter-rater agreement (Kappa) of κ = 0.93.

Appraisal of quality

The psychometric properties of included studies were analysed using Terwee et al. [87] criteria of adequacy for measurement properties. This quality appraisal tool was designed to assess health status questionnaires and has been used in several previous systematic reviews [9, 59, 84, 101]. This tool assesses nine measurement properties, including (1) content validity, (2) internal consistency, (3) criterion validity, (4) construct validity, (5) reproducibility: agreement, (6) reproducibility: reliability, (7) responsiveness, (8) floor and ceiling effects, and (9) interpretability. See Table 1 for the definition and criteria of adequacy for each of the nine properties.

Table 1 Criteria of quality of psychometric properties [87]

Criteria were given the following evaluative ratings; positive (+), intermediate (?), negative (−), no information available (0). Intermediate ratings may be given if there are serious doubts about study methodology, as per the guidelines for utilising these criteria [87]. It is essential to consider the methodological quality of included studies when assigning a rating, as those with low methodological quality will have a greater likelihood of reporting biased results. Further, the criteria for internal consistency allowed studies to consider the results of past factor analyses, or item response theory (IRT) analyses, when conducting a Cronbach alpha (i.e., as each study did not necessarily need to carry out a new factor analysis). Finally, Terwee et al. [87] indicate that all measurement properties are not necessarily equally important, thus, we followed the recommendation not to provide a summary or overall score.

Further, included studies were summarised by describing if they were a development study, any factor analyses performed, the study population, sample size, mean age and standard deviation, and the sex ratio (% females). Included measures (or subscales) were summarised by describing the construct or goal being measured, the number of items, response categories and any subscales or factors. Attempts were made to obtain missing or unclear information by contacting the authors of studies assessed for eligibility. Missing or unclear information that did not affect inclusion was still recorded (as either ‘?’ or Not Applicable [N/A]). In the absence of information from authors contacted, an assumption was maintained that if participants were described to be of ‘college’ or ‘university’ age, that they were a part of an adult sample (17+).

Results

Results of search strategy

The initial search identified 7581 potential studies. After removal of 2484 duplicates, this resulted in 5097 potential studies, of which 57 were considered to have met the inclusion criteria. An additional two studies were identified by cross checking reference lists for articles of interest and searching google scholar. This resulted in a total of 59 included studies (see Table 2 for summary of included studies). The selection process is summarized in Fig. 1.

Table 2 Description of studies
Fig. 1
figure 1

PRISMA diagram of study identification, screening and selection (PRISMA [66])

A total of 31 measures (or subscales) were identified that met inclusion criteria from the included studies (see Table 3 for summary of included measures). Nine of the included measures were subscales, and nine were a short-form or revised version of an original measure. These measures were the Beliefs About Appearance Questionnaire (BASS), Bulimic Automatic Thoughts Test (BATT), Body Checking Cognitions Scale (BCCS), Bulimia Cognitive Distortions Scale (BCDS), Bulimic Thoughts Questionnaire (BTQ), Eating Beliefs Questionnaire (EBQ), Eating Beliefs Questionnaire-18 (EBQ-18), Eating Disorder Beliefs Questionnaire (EDBQ, Eating Disorder Core Beliefs Questionnaire (ED-CBQ), Eating Disorder Core Beliefs Questionnaire Revised (ED-CBQ-R), Eating Disorder Inventory Body Dissatisfaction subscale (EDI [BD]), Eating Disorder Inventory II Body Dissatisfaction subscale (EDI-II [BD]), Eating Disorder Inventory III Body Dissatisfaction subscale (EDI-III [BD]), Eating Expectancy Inventory (EEI), Functions of Binge Eating Scale (FBES), Irrational Food Beliefs Scale (IFBS), Interpersonal Outcome Expectancy for Thinness scale (IOET), Mizes Anorectic Cognitions Questionnaire (MACQ), Mizes Anorectic Cognitions Questionnaire Brief (MACQ-B), Mizes Anorectic Cognitions Questionnaire Revised (MACQ-R), Muscle Dysmorphia Inventory Drive for Size subscale (MDDI [DS]), Perceived Benefits of Thinness Scale (PBTS), Sociocultural Attitudes Towards Appearance Questionnaire 4 Internalisation Thin subscale (SATAQ-4 [IT]), Sociocultural Attitudes Towards Appearance Questionnaire 4 Revised Internalisation Thin subscale (SATAQ-4R [IT]), Stirling Eating Disorders Scale Anorexic Dietary Cognitions subscale (SEDS [ADC]), Stirling Eating Disorders Scale Bulimic Dietary Cognitions subscale (SEDS [BDC]), Stirling Eating Disorders Scale Low Self-esteem subscale (SEDS [LSE]), Testable Assumptions Questionnaire for Eating Disorders (TAQ-ED), Testable Assumptions Questionnaire for Eating Disorders Revised (TAQ-ED-R), Thoughts Questionnaire (TQ), and the Weight Influenced Self-Esteem Questionnaire (WISE-Q).

Table 3 Description of measure

The cognitive focus of each measure was described using four broad categories: (1) negative thoughts about food and eating, (2) negative thoughts about weight, shape, or body image, (3) self-referent beliefs, and (4) meta-cognitive beliefs (see Table 4). These categories were formed by examining the content of included measures and identifying the main areas of cognitive content and the types of cognitions assessed. Altogether, 15 measures assessed negative thoughts about food and eating, 20 assessed negative thoughts about weight, shape or body image, nine assessed self-referent beliefs, and nine assessed meta-cognitive beliefs. The BCDS, TAQ-ED, and TAQ-ED-R considered all four categories of cognitive focus.

Table 4 Cognitive focus of included measures

Assessment of psychometric properties

The psychometric properties of each included study were assessed using the criteria outlined by Terwee et al. [87]. This assessment was independently conducted by the same two reviewers who screened and assessed the studies for eligibility (AH and SB). Agreement between the reviewers for the criteria of adequacy was 94% (κ = 0.91). Consensus was reached to resolve discrepancies between the two assessors, and as such a third reviewer was not necessary. The summary ratings for each measure are displayed in Table 5.

Table 5 Quality analysis/ratings of psychometric properties

Content validity

Content validity refers to the extent that the items of a measure are an accurate reflection of the construct of interest [87]. To have received a positive rating, studies (measure or subscale) describing the development of the measure were required to provide a clear description of the aim of the measure, concepts measured, target population and item selection. Further, they were required to have had experts and a relevant target population involved in item selection. Seventeen measures received a positive rating for this criterion, including the BASS, BATT, BCCS, BCDS, BTQ, EDBQ, ED-CBQ, ED-CBQ-R, EEI, IFBS, IOET, MACQ-B, MACQ-R, SEDS (ADC) subscale, SEDS (BDC) subscale, and the SEDS (LSE) subscale. These findings indicated that these measures (or subscales) demonstrated adequate content validity.

Internal consistency

Internal consistency refers to the extent to which items in a measure are correlated, and thus assess the same construct [87]. To have received a positive rating, studies must have reported a Cronbach’s alpha of adequate magnitude for the measure (or subscale) and performed a factor analysis using an adequate sample size. When assessing internal consistency for subscales, factor analysis was considered if it was performed solely on the subscale or on the full scale if it tested the included subscale as a factor. Fourteen measures received a positive rating for this criterion. These findings indicated that BATT, EBQ, EBQ-18, EDBQ, ED-CBQ-R, EDI-III (BD) subscale, FBES, MACQ, MACQ-R, MDDI (DS) subscale, PBTS, SATAQ-4 (IT) subscale, SATAQ-4R (IT) subscale, and the TQ have demonstrated adequate internal consistency.

Criterion validity

According to the Terwee et al. [87] criteria, criterion validity is determined by comparison to a ‘gold-standard’ instrument, ensuring the new measure is theoretically related to a well-established measure. Given that we imposed no limits on studies and measures included by time period, and that there is no suitable or widely agreed upon gold standard for assessing ED cognitions, if the study assessed for criterion validity as per Terwee et al. [87] definition, we allowed each study and its authors to provide their own justification or a convincing argument for their definition of a gold-standard. While the term ‘criterion validity’ was indeed used in some studies, the authors were, in fact, referring to other forms of validity (e.g., content validity or construct validity). No measures received a positive rating, most received a ‘no information available’ rating. Five measures (EBQ-18, EDBQ, EDI [BD] subscale, EDI-II [BD] subscale, EDI-III [BD] subscale) received an indeterminate rating for a lack of convincing argument for the gold standard measure utilised for comparison, and one (TQ) received a negative rating as the correlation with the proposed gold standard was < 0.07.

Construct validity

Construct validity refers to the degree to which scores on a particular questionnaire relate (or do not relate) to other measures in a manner that is consistent with theoretically derived hypotheses concerning the concepts that are being measured [87]. To have received a positive rating, studies were required to provide clear predictions regarding their hypotheses with 75% of the results in the expected direction (e.g., reporting the correlation between two measures in the expected direction)’. Almost all measures received a positive rating for this criterion, indicating the vast majority possessed adequate construct validity. Only the TAQ-ED-R received an indeterminate rating, and the BTQ received a ‘no information available’ rating.

Reproducibility: agreement (test–retest)

Agreement refers to the extent to which scores on a measure remain stable over time. According to Terwee et al. [87] adequate agreement is demonstrated when the absolute measurement error is smaller than the Minimally Important Change (MIC) factor. However, MIC was not defined or utilised in any of the studies evaluated in this review. Therefore, we utilised the criterion for agreement previously used by Burton et al. [9], and Zuccala et al. [101], defining adequate agreement as a test–retest reliability of r > 0.70. To have received a positive rating for this criterion, the means and standard deviations must have been presented at both time points. Seven measures received a positive rating for this criterion, indicating the EBQ, EBQ-18, EDI-III (BD) subscale, IOET, MACQ, MDDI (DS) subscale, and the PBTS possessed adequate test–retest agreement.

Reproducibility: reliability

Reproducibility reliability refers to the extent to which individuals can be distinguished from each other [87]. To have received a positive rating measures needed to provide an intraclass correlation or weighted Kappa > 0.70 to test this. The EBQ-18 and the EDI-III (BD) subscale were the only measures found to demonstrate adequate reliability. All other measures all received a ‘no information available’ rating.

Responsiveness

Responsiveness refers to the ability of a measure to detect clinically important changes over time or following an intervention [87]. To have received a positive rating, the studies must have outlined a treatment program and the longitudinal expected changes, and 75% of results must have been in the expected direction. Alternatively, measures had to demonstrate a responsiveness ratio of > 1.96 or an area under the curve > 0.70. Only the EBQ, EBQ-18, SATAQ-4 (IT) subscale, and WISE-Q demonstrated adequate responsiveness.

Floor and ceiling effects

According to Terwee et al. [87], floor and ceiling effects have occurred when > 15% of participants achieve the highest or lowest possible score on a measure (or subscale). Only the EDI-III (BD) subscale demonstrated a positive rating for this criterion, for reporting information demonstrating a lack of floor and ceiling effects. The ED-CBQ-R demonstrated a floor effect on one of its subscales, consequently receiving a negative rating. All other measures all received a ‘no information available’ rating.

Interpretability

Interpretability refers to the extent to which qualitative meaning can be given to quantitative scores [87]. To have received a positive rating, a measure must have presented means and standard deviation scores for at least four relevant subgroups within one study (in isolation). As per Terwee et al.’s suggestion, subgroups stratified by demographic variables (e.g., age, gender) may be included as subgroups. Further, as in Zuccala et al. [101], this criterion was modified such that a minimal important change (MIC) was not required to have received a positive rating, as no studies reported a MIC. Findings indicated that 14 measures (or subscales) possessed adequate interpretability, including the BATT, EBQ, EBQ-18, ED-CBQ-R, EDI (BD) subscale, IFBS, MACQ-R, SATAQ-R (IT) subscale, SATAQ-4R (IT) subscale, SEDS (ADC) subscale, SEDS (BDC) subscale, SEDS (LSE) subscale, TQ, and the WISE-Q. As Modini et al. [59] and Zuccala et al. [101] have previously noted, it is important to note that if considering an accumulation of subgroups between studies (not within one study in isolation), it is likely that more measures would have met this criterion.

Discussion

Prioritising the assessment of ED cognitions is essential for monitoring key factors contributing to the development and maintenance of EDs. It is essential that self-report measures possess adequate psychometric properties to increase their validity, reliability and utility for clinical research and practice. This systematic review aimed to evaluate the psychometric properties of existing self-report measures of ED cognitions using the Terwee et al. [87] criteria of adequacy for measurement properties, to conduct quality assessment.

This review identified 59 studies that evaluated the psychometric properties of 31 self-report measures (or subscales) that assess ED cognitions. The type of cognitive focus examined most across these measures was negative thoughts about weight, shape or body image, followed then by negative thoughts about food and eating. Self-referent and meta-cognitive beliefs were types of cognition assessed to a lesser extent across the included measures. This suggests that the majority of self-report measures specifically developed to assess ED cognitions have primarily been focused on the ‘thought’ level, where fewer have been developed to examine longstanding or engrained beliefs sets, including conditional and unconditional assumptions and beliefs.

With respect to the psychometric properties of included measures, no measure (or subscale) received positive ratings across all categories. The criterion that received the greatest number of positive ratings across all measures was construct validity. This suggests firstly that there is a propensity for studies to examine construct validity, as compared to other psychometric properties which are less routinely considered, and also of a general trend that existing measures assessing ED cognitions show demonstrate strong evidence for construct validity. This is compared to other criterion categories, which findings demonstrate are evaluated and considered less overall. Examples include criterion validity, reproducibility, reliability, responsiveness and floor and ceiling effects, where the majority of measures had no information found on the assessment of these properties. Interestingly, in terms of content validity, most studies and measures indeed clearly described the concepts measured, the intended target population, also involving experts in item selection. However, all measures with an indeterminate rating were characterised by a lack of clarity about whether and how the target population had been involved in item selection, and those with a negative rating failed to describe any involvement of the target population in item selection. It is clearly important to have items reflecting cognitive constructs that have been acknowledged as relevant by those populations whose cognitions the measures are attempting to describe and assess.

In relation to the four broad categories of cognitive focuses identified, some specific observations and subsequent recommendations can be made. Whilst attempting to assess negative thoughts relating to food and eating, the EBQ-18 currently possesses the most evidence for its validity, reliability and utility, compared to other measures assessing similar content. Relatedly, of measures that assess negative thoughts related to weight, shape and body image, the EDI-III (BD) subscale and the SATAQ-4 (IT) subscale have the most psychometric support. However, it is important to note these measures both focus on body dissatisfaction that implies desire for a thinner body, as opposed to, for example, a more muscular body. Thus, even within the categories identified, it remains important to consider the construct of interest when selecting and utilising a cognitive ED measure. With respect to self-referent beliefs, the ED-CBQ-R currently has the most evidence for its validity, reliability and utility. The EBQ-18 has the most empirical support for its psychometric properties of measures that assess meta-cognitive beliefs related to EDs.

An overall pattern also emerged indicating that newer, revised versions of an original measure demonstrated a greater number of positive ratings, as evidenced by the EBQ-18, ED-CBQ-R, EDI-III (BD), and the MACQ-R. The results of this systematic review also suggest that the EBQ, EBQ-18, and EDI-III (BD) have the most evidence in support of their psychometric properties. However, a strict conclusion cannot be drawn that these measures are objectively superior, as not all measurement properties are necessarily equally important [87]. As previously noted [101], different measures may be utilised for different purposes, and certain psychometric properties may have varied importance under diverse circumstances. For example, for the purposes of a clinical trial, adequate responsiveness might be of greatest importance, whereas in a prevalence study aiming to identify the presence or absence of various cognitions, choosing a measure with greater breadth may be more beneficial. Moreover, conclusions cannot be made as to the superiority of a specific measure, as this review considers different types of ED cognitions. The included self-report measures of ED cognitions examine different constructs (e.g., body dissatisfaction versus meta-cognitive beliefs about eating), and types of cognitions (e.g., automatic thoughts versus unconditional beliefs). Thus, we recommend those wishing to utilise measures with more evidence for their validity, reliability and utility, should also consider the construct of interest, its intended usage, and the type of cognition to be assessed.

This systematic review had several notable strengths, including the standardised and systematic approach applied through its utilisation of a previously validated tool, widely utilised in previous systematic reviews evaluating psychometric properties of self-report measures [9, 59, 84, 101]. This review also demonstrated evidence of good interrater reliability at several stages of the review process. Additionally, the present review included all versions of measures in included studies, regardless of whether the measure was an original, revised or short version of a self-report measure. This facilitated a comprehensive summary and allowed for comparison of the state of evidence for each questionnaire version.

It is also important to note several limitations, the first of which is the stringent nature of the appraisal of quality tool utilised in the present study. The criteria used may have meant that some measures received either an intermediate or negative rating in accordance with strict standards for appraisal of adequacy, when using less exacting criteria might have resulted in a more positive rating. For example, we adhered to the strict Cronbach’s alpha cut-off in assessing internal consistency, where a few studies presented alphas that were only marginally above or below the range given in order to give a positive rating. The authors attempted to address this in several instances whilst still maintaining the intended rigor, by providing some leniency across ratings as described previously, and in alignment with carefully considered modifications that have also been employed in previous systematic reviews [9, 101]. For example, in test–retest reliability, in consenting to studies measuring reliability using more frequently utilised statistical methods. Further, authors were somewhat lenient in terms of content validity where body image measures were concerned, as the requirement for ‘involving the target population in item selection’ meant that the target population did not necessarily need to be a clinical ED population, but simply those presenting with varied body image concerns. Altogether, it was considered necessary and beneficial to continue to maintain this standard in order to recommend usage of psychometrically sound measures, and importantly to provide researchers incentive to continue to improve quality of existing and future assessment tools, and by extension, improve quality of empirical evidence more generally.

Additionally, it is important to consider the strict nature of our inclusion and exclusion criteria alongside the evidence presented. A limitation of this review is that only articles utilising an adult population and those utilising a non-English speaking population were considered for inclusion. As such, some measures or studies in support of certain psychometric properties might have been excluded in accordance with these criteria. Ultimately, studies utilising non-English speaking populations or measures not administered in English were out of the scope of this review. The authors also considered the importance of decreasing some heterogeneity in the evidence evaluated in the context of this systematic review [51]. This was relevant also to the exclusion of studies utilising child, purely adolescent, or mixed populations. As a function of differences in ED symptomatology across developmental stages [23, 50], it was important to consider potential differences in ED cognitions between adult and child or adolescent populations, and that some assessment tools may be valid and reliable in one population and not another.

Finally, the stringency of our criteria regarding measures being developed specifically and solely to measure ED cognitions, meant some more commonly utilised ED measures were not included in this systematic review. This included the EDE-Q [26] concerns subscales and versions of the Eating Attitudes Test (EAT-40 [34], EAT-40 [35]). Examination of their intention during development and careful scrutiny at the item level reflected that they captured other symptomatology, including emotional, behavioural symptoms, and items measuring functional impairment. These exclusions may impact the clinical utility of this review, due to the frequency of use of these measures for clinical and diagnostic purposes [42]. However, we also considered that the psychometric properties of these two measures have been assessed and systematically reviewed elsewhere [5, 45, 67]. Another commonly utilised cognitive measure not considered for inclusion was the Young Schema Questionnaire (YSQ-S3) [100]. Although useful in capturing core cognitive constructs, it was not developed specifically to capture ED cognitions, which was the focus of this systematic review. Finally, although this study included the EDI-III (BD), previous systematic reviews have considered the EDI-III (BD) to assess both evaluative and affective components of body image [48]. Due to this discrepancy, we recommend findings be interpreted with caution.

The results of this systematic review suggest several areas of improvement for future research. Primarily, our findings do not necessarily indicate that measures lacking adequate evidence should not be utilised, but primarily highlights where there is either an absence of psychometric support, there is clear suggestion for future researchers to focus efforts on improving evidence for the validity, reliability and utility of these measures for the populations considered. There is perhaps greater need to consider the responsiveness and clinical utility of ED cognitive measures, as well as less widely considered measurement properties, such as criterion validity and examining floor and ceiling effects.

When examining the cognitive focus, content and types of cognitions of included measures, it is clear that most tend to consider only AN, BN, BED, and assessment of body image concerns. There is certainly space to consider the development and availability of tools to assess cognitions in comparatively less well researched EDs, or those with rapidly increasing clinical focus. This includes disorders such as Avoidant Restrictive Food Intake Disorder (ARFID) and orthorexia nervosa, which currently only have symptomatic and behavioural inventories, and tools available to support diagnosis. Finally, whilst examining populations most frequently utilised in included studies, it is clear that research needs to continue to prioritise inclusion of more male and gender-diverse samples. This is particularly critical due to growing evidence outlining the increased and substantial ED risk in gender-diverse individuals [37, 39, 81].

Altogether, the present study was the first to provide a systematic review of self-report measures of ED cognitions, and valuable information about the existing, relevant evidence for their psychometric properties. This information provided a basis for the future selection of valid, reliable and clinically useful tools for measuring a variety of ED cognitions. Although no measure appeared to possess adequate evidence across all nine measurement criteria, several provided a good amount of evidence in support of their reliability, validity and utility. Ultimately, comprehensive information was provided to support future selection of measure of ED cognitions dependent on the specific aims of research and/or treatment. It is hoped that the findings of the present review assist both researchers and clinicians alike in identifying, evaluating and comparing relevant measures for use in identifying and monitoring important treatment targets in clinical, sub-clinical or prodromal ED populations.