Introduction

Ulcerative colitis and Crohn’s disease are the main forms of inflammatory bowel disease (IBD). Both pathologies involve chronic inflammation of the gastrointestinal tract and show heterogeneity in terms of symptoms, which mainly include abdominal pain and diarrhea associated with malabsorption, weight loss and fever [1]. IBD involves periods of relapse and remission [2]. Although its etiology is unknown, it has been considered a multifactorial disease due to its association to genetic factors [3], immune mediators [4], changes in the intestinal microbiome [5] and exposure to various environmental agents [6].

The onset of IBD generally occurs around the third decade of life, but 25% of cases begin during childhood and adolescence [7]. The peak age of onset for Crohn’s disease is generally between 20 and 30 years of age, while Ulcerative Colitis usually begins at around 30 and 40 years of age [8].

The incidence and prevalence of IBD vary according to the geographic location, environment and ethnicity [9]. The latest reported data on the incidence of Ulcerative Colitis in North America and Europe ranged from 0 to 19.2 per 100,000 and 0.6 to 24.3 per 100,000, respectively [10]; whereas the prevalence of Ulcerative Colitis was 37.5 to 248.6 per 100,000 in North America and 4.9 to 505 per 100,000 in Europe [11]. For Crohn’s disease, the incidence varied from 0 to 20.2 per 100,000 in North America and from 0.3 to 12.7 per 100,000 in Europe [10]. In Latin America these data have considerable differences, however, in the last decades there has been a progressive increase with a prevalence of 0.99 to 44.3 per 100,000 inhabitants for Ulcerative Colitis and 0.24 to 16.7 per 100,000 inhabitants for Crohn’s disease [12, 13]. Epidemiological data suggest that the global incidence of IBD presents a marked increase, implying that the health systems of developing countries do not have the resources, health staff and infrastructure necessary for the diagnosis and treatment of the pathology.

Considering the increasing prevalence of IBD and its impacts in terms of health, society and economy (direct and indirect costs for the health systems and out-of-pocket expenses) [13], it is important to ensure high quality tools that facilitate its systematized treatment. For this reason, in the last decade, there have been important advances in terms of therapies for the management of IBD through pharmacological, non-pharmacological and surgical interventions [14, 15], these advances have been translated into several Clinical Practice Guidelines (CPG), which quality has not yet been assessed.

CPGs are systematically developed statements intended to help physicians and patients to make decisions about appropriate medical care in specific circumstances based on high-quality scientific evidence [16]. Their recommendations are intended to improve the quality of patient care by encouraging interventions of proven benefit and discouraging ineffective or potentially harmful interventions [16]. Several tools currently exist to assess the quality of a CPG and its implementation [17]; the AGREE (Appraisal of Guidelines, Research, and Evaluation) collaboration developed the AGREE II tool which is the most validated and widely used tool [18, 19]. This tool is helpful to assess the transparency in guidelines development and their quality, it provides a methodological strategy for guidelines development, and establishes a scheme for their reporting [20]. The AGREE II tool can be applied in Clinical Practice Guidelines (CPG) for diagnosis and medical interventions as well as for the evaluation of guidelines on health promotion, public health, among others [20].

Therefore, the main objective of this study is to systematically evaluate CPG for the diagnosis and treatment of IBD using the AGREE II tool, to provide evidence on their methodological quality and to assess changes in guideline quality over time.

Methods

Data Search

A systematic search was performed up to January 2022 to look for CPG on the diagnosis and treatment of IBD. CPGs were searched on databases (MEDLINE - PubMed, EMBASE, CINAHL, LILACS), professional societies (CAG, British Society of Gastroenterology, AGA, Brazilian Society of Gastroenterology), registries and guideline developers’ websites (NICE, SIGN). The full search strategy is detailed in Additional file 1.

Inclusion and exclusion criteria

We included: 1.- CPGs with specific recommendations for the diagnosis and treatment of IBD, both for Crohn’s disease (CD) and ulcerative colitis (UC); 2.- CPGs on IBD that included pediatric, young, adult, and elderly populations; 3. - CPGs that provide the full search strategy that was conducted; 4.- CPGs that mentioned the process how they reached recommendations; 6.- CPGs published without date restriction until January 2022; and 6.- Last published available version of CPGs. The following documents were excluded: 1.- CPGs exclusively dealing with other clinical scenarios such as diagnostics (e.g. endoscopic, imaging), nutrition, immunological or surgical interventions for IBD; 2.- secondary publications (e.g., systematic reviews and meta-analyses) and 3.- abstracts from CPGs.

Data Collection

Five reviewers working in pairs (DH, CMG, PA, RZ, RV) independently peer-screened the guidelines by title and abstract following the above inclusion and exclusion criteria. If the inclusion criteria were met, the full-text article were retrieved and screened by pairs for eligibility. All the screening process was performed using Rayyan (Rayyan Systems Inc) [21]. Two reviewers independently extracted the following data for each CPG: title, year of publication, submitting organization, type of funding, method used to collect evidence, number of sources documented, methods used to assess the quality and validity of the evidence, methods used to formulate the recommendations, country, and language. In case of disagreement, a third reviewer (VA, DSR) was involved.

Quality assessment

The AGREE II instrument [18,19,20, 22] was used to evaluate the quality of the included CPGs. This instrument provides criteria for assessing the quality of the clinical practice guidelines through 23 items or questions, divided into 6 domains or categories; including: 1.- scope and purpose, 2.- stakeholder involvement, 3.- rigor of development, 4.- clarity of presentation, 5.- applicability, and 6.- editorial independence. The first domain evaluates the general objective of the CPG, specific health aspects and the target population; the second domain refers to the degree to which the guideline has been developed by the appropriate stakeholders and represents the views of intended users; the third domain refers to the process used to gather and synthesize evidence, the methods used to formulate and update recommendations; the fourth domain focuses on the language, structure and format of the guideline; the fifth domain refers to barriers and facilitators to CPG implementation, strategies for its adoption and resource considerations; and finally, the sixth domain is about the formulation of recommendations, to understand whether they are biased by conflicts of interest [19].

Each of the 23 items or questions is classified on a 7-point Likert-type scale, 7 being the maximum score corresponding to “strongly agree” and 1 the minimum score corresponding to “strongly disagree”.

For the global guideline evaluation, we used a 3-point scale: 1 “not recommended”, 2 “recommended with modifications” and 3 “recommended”. Six reviewers (DH, CMG, PA, RZ, JAF, RV), with clinical and methodological expertise, independently peer-scored each of the 23 items of the 6 domains of the AGREE II instrument for each CPG that was included. In case of disagreements with the assessment, a consensus was reached with the support of a third reviewer (AV, DSR).

Statistical analysis

A descriptive analysis of the CPGs was performed using the general characteristics of each CPG from the extracted data. To calculate the score for each domain of the AGREE II tool, all item scores were summed up and the total value was standardized as a percentage of the maximum possible score for that domain, using the following formula:

$$\textrm{Standardized}\kern0.5em \textrm{score}\kern0.5em \left(\textrm{SP}\right)=\frac{\textrm{score}\kern0.5em \textrm{obtained}\kern0.5em \hbox{-} \kern0.5em \textrm{lowest}\kern0.5em \textrm{possible}\kern0.5em \textrm{score}}{\textrm{highest}\kern0.5em \textrm{possible}\kern0.5em \textrm{score}\kern0.5em \hbox{-} \kern0.5em \textrm{lowest}\kern0.5em \textrm{possible}\kern0.5em \textrm{score}}\times 100$$

With this method, the standardized score for each domain ranged from 0 to 100%. The result of the standardized score for each domain for all the guidelines is presented through the mean, median, first quartile (Q1), third quartile (Q3), interquartile range (IQR) and a boxplot. The degree of agreement between reviewers was assessed through the intraclass correlation coefficient (ICC) with a 95% confidence interval (CI). To visualize and compare the mean AGREE II scores obtained by the 26 CPGs assessed in this study, we generated a hexagonal radar graph where each domain is represented on a radial axis centered at 0 and the maximum score of each domain corresponds to each vertex of the hexagon. Finally, for the analysis of quality change over time, Student’s t-test was used to compare the means and categorize the CPGs into two periods: 2012 to 2017 and 2018 to 2022. Data analysis was performed in the statistical software RStudio v.1.4 [23] using the libraries ggplot2 [24], irr [25], tidyverse [26] and Table 1 [27].

Table 1 General characteristics of the CPGs

Results

Guideline characteristics

Eight thousand seven hundred twenty-three records were retrieved from the search strategy and 8165 remained after deduplication. 203 records were subsequently screened by full-text, of which 26 CPGs were included for data extraction after meeting the inclusion criteria (Fig. 1). Details on the characteristics of the included CPGs are shown in Table 1 [28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53].

Fig. 1
figure 1

PRISMA flow diagram showing the flow of records that were obtained and reviewed throughout the different phases of the quality assessment

Of the 26 included CPGs, four were from the United States (15.38%) and four were developed by an international collaboration (15.38%); three were from the United Kingdom, three from Canada and three from Japan (11.53% each), two were from Brazil and two from Mexico (7.69% each); one was from Germany, Israel, South Korea, the Netherlands and Poland (3.84% each). Included guidelines were published between 2012 and 2021 (see Table 1).

Three of the 26 guidelines focused exclusively on the pediatric population while the others were mainly focused on adults [29, 33, 51]. In terms of the scope of the CPGs, 22 dealt with diagnosis and clinical management [28,29,30, 32,33,34,35,36,37,38,39,40,41, 43,44,45, 47,48,49, 51,52,53], two with the use of biologic drugs only [42, 46], one with surgical management in the emergency setting [50] and one with the surgical management of ulcerative colitis [31]. All guidelines were considered evidence-based according to our a priori criteria.

Eighteen guidelines (69.23%) used the Grading of Recommendations Assessment, Development and Evaluation (GRADE) methodology to assess the quality of evidence and grade the strength of recommendations. Seven guidelines (26.92%) used the Oxford Centre for Evidence-Based Medicine criteria, and one guideline (3.84%) used a self-grading system to assess the quality of evidence (Table 1).

Quality assessment

The agreement between the 6 reviewers was moderate with an ICC of 0.74 (95% CI: 0.36-0.89, p-value = 6.83e−4). A summary of the ICCs achieved by each pair of reviewers is shown in Table 2.

Table 2 Intraclass correlation coefficients (ICC) by peer reviewers

Figure 2 shows a boxplot summarizing the statistical analysis of the standardized scores for each domain assessed with the AGREE II tool. In addition, Table 3 shows the standardized scores for all domain assessed in each clinical practice guideline.

Fig. 2
figure 2

Distribution of standardized scores by domain for the 26 CPGs

Table 3 Standardized scores by domains of AGREE II

Domain 1: Scope and purpose

This domain evaluates the general objective of the CPG, specific health aspects and the target population [19]. The mean score was 84.51% (median: 90.27%, Q1: 78.47%, Q3: 94.44% and IQR = 15.97%; Fig. 2). Twenty-four CPGs (92.30%) scored above 60% in this domain [28, 29, 31,32,33,34,35,36, 38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53]. See Table 3 for details on domain 1.

Domain 2: Stakeholder involvement

This domain refers to the degree to which the guideline has been developed by the appropriate stakeholders and represents the views of intended users [19]. The mean score was 60.90% (median: 66.67%, Q1: 36.11%, Q3: 83.33% and IQR = 47.22%; Fig. 2). Fourteen CPGs (53.84%) scored above 60% in this domain [32, 38, 40, 41, 43,44,45, 47,48,49,50,51,52,53]. See Table 3 for details on domain 2.

Domain 3: Rigor of development

This domain refers to the process used to gather and synthesize evidence, the methods used to formulate and update recommendations [19]. The mean score was 69.95% (median: 69.79%, Q1: 58.07%, Q3: 86.20% and IQR = 28.12%; Fig. 2). Nineteen CPGs (73.07%) scored above 60% in this domain [28, 32,33,34, 36, 38,39,40,41, 43,44,45, 47,48,49,50,51,52,53]. See Table 3 for details on domain 3.

Domain 4: Clarity of presentation

This domain focuses on the language, structure and format of the guideline [19]. The mean score was 85.58% (median: 91.67%, Q1: 75.00%, Q3: 100.00% and IQR = 25%; Fig. 2). Twenty-four CPGs (92.30%) scored above 60% in this domain [28,29,30,31,32,33,34,35,36,37,38,39,40,41, 43,44,45, 47,48,49,50,51,52,53]. See Table 3 for details on domain 4.

Domain 5: Applicability

This domain refers to barriers and facilitators to CPG implementation, strategies for its adoption and resource considerations [19]. The mean score was 26.60% (median: 20.83%, Q1: 12.50%, Q3: 39.06% and IQR = 26.56%; Fig. 2). Only one CPG (3.84%) scored above 60% in this domain [38]. See Table 3 for details on domain 5.

Domain 6: Editorial Independence

This domain is about the formulation of recommendations, understand whether they are biased by conflicts of interest [19]. The mean score was 62.02% (median: 75.00%, Q1: 30.21%, Q3: 91.67% and IQR = 61.45%; Fig. 2). Sixteen CPGs (61.53%) scored above 60% in this domain [28, 32, 34, 35, 37, 38, 40, 41, 43,44,45, 47,48,49, 51, 53]. See Table 3 for details on domain 6.

Overall assessment

Seven out of the 26 evaluated CPGs (26.9%) were “recommended” by the independent reviewers [38, 40, 43, 44, 47, 49, 53]. Most of the CPGs, 15 guidelines (57.7%), were “recommended with modifications” [28, 29, 31,32,33,34,35,36, 39, 41, 45, 48, 50,51,52]. Finally, 4 CPGs (15.4%) were “not recommended” (see Table 3) [30, 37, 42, 46].

Combined assessment

Finally, in the radar plot analysis we observe that domains “scope and purpose”, “stakeholder involvement”, “rigor of development”, “clarity of presentation” and “editorial independence” show similar areas in the scores achieved; however, the domain “applicability” is notoriously deficient in all the evaluated guidelines (Fig. 3).

Fig. 3
figure 3

Radar chart of the mean standardized scores by domains of the 26 IBD CPGs assessed

Quality assessment over time

With respect to quality change over time, no statistically significant differences were found for the means of the standardized scores for each AGREE II domain between the guidelines published during the 2012-2017 period and those published between 2018 and 2022 (Table 4).

Table 4 Quality changes over time

Discussion

What do the findings of this study mean?

This review showed that the evaluated IBD CPGs had an acceptable quality based on the AGREE II instrument since 7 out of the 26 evaluated guidelines were “recommended”, 15 were “recommended with modifications” and only 4 were “not recommended”. The domains with the highest scores were “clarity of presentation” and “scope and purpose”, which reached values over 60%, indicating that most of the assessed guidelines had well-defined general and specific objectives, the population to which the guideline was intended to apply was well defined, and the recommendations were clearly described and identifiable. Rigor of development was the domain that received the third best score with 69.95%; this domain could be argued to have the greatest effect on the quality of a clinical practice guideline, since it has to do with the entire process used to formulate and construct the recommendations and it is the one that comprises the most items within AGREE II for its evaluation [54]. We consider that a score over 60% is more than acceptable for “rigor of development”, which achieved this score due to most guidelines were partly penalized for being unclear with the description of external experts’ assessment and for not having an explicit updating statement.

The domains “stakeholder involvement” and “editorial independence” obtained scores slightly over 60% (60.90 and 62.02%, respectively; Fig. 3), which indicates that the views and preferences of patients still need to be considered when the CPG is drafted and that an expert methodologist/epidemiologist should be included in the guideline drafting group. In addition, both domains achieved low scores due to the limited information most guidelines provided in terms of funding and its influence on the guidelines’ content, as well as the lack of detail they included regarding conflicts of interest and how these conflicts were dealt. Considering these limitations on the development of CPGs could contribute to their improvement.

The “applicability” domain was the worst scored domain in this review with an average score of 26.60% (Fig. 3), well below the 60% cut-off point for this domain. The main reason for this is that most guideline developers do not fully consider guideline’s implementation in terms of facilitators and barriers for guidelines’ applicability or they do not fully consider the resources and tools that are available in a specific context. We also noted that most of the guidelines did not consider the economic impact of their recommendations on resources and health budgets, for example, most guidelines did not include health economists in the guideline development group or did not perform cost-benefit analysis. The limitations and omissions that have been observed in the included guidelines restraint the translation of these documents into clinical practice, thus hindering its operability.

Regarding quality change over time, this study failed to demonstrate statistically significant differences between guidelines published during the 2012-2017 period versus guidelines published between 2018 and 2022 (see Table 4) for any domain covered by AGREE II. This finding may be due to the small sample size in this study, which is associated to the specific inclusion criteria applied in the selection of CPGs as well as the large variety of CPGs for IBD (clinical, surgical, preventive, etc.) we encountered when screening. In addition, the time ranges we compared were too short since guidelines’ development in terms of IBD and our study’s criteria has been an early activity. However, one point to highlight is the implementation and dissemination of the GRADE methodology in the development of guidelines, especially in those produced in the last 4 years; our study found that 17 out of the 26 included CPGs had used this methodology as a framework for grading the evidence and formulating their recommendations.

The context of this review with other literature

While this review is not the first to evaluate clinical practice guidelines on inflammatory bowel disease, it is the first to evaluate a large sample of CPGs as there was no date restriction in its search, which gave us a much broader picture of what has been produced in the past and current time. Thus, in line with other reviews of CPG for IBD conducted by other investigators, and addressing different contexts of inflammatory bowel disease, the domains with the highest scores were “clarity of presentation” and “scope and purpose” and the domains with the lowest scores were “stakeholder involvement” and “applicability” [55,56,57]. These results are also similar to previous CPG evaluations for other clinical-surgical areas such as interventional radiology, pediatrics or dermatology [58,59,60].

In addition, other studies that investigated quality changes over time for clinical practice guidelines in other specialties did not find evidence of significant changes in quality in the different evaluated periods of time [61,62,63]. These results are consistent with the findings of this study. However, studies by Bhatt et al. [64] for pediatric type II diabetes CPG and Acuña-Izcaray et al. [65] for asthma CPG, found statistically significant differences in quality over time for the selected periods for each individual domain, while a statistical significance has not been found for all domains at the same time.

Strengths and limitations

Although a strength of our systematic review was the broad and exhaustive approach of our search – carried out in databases, compiling entities and guideline developers, with a sensitive strategy designed for this purpose – it is possible that our review may have missed some CPGs that were not adequately indexed or that dealt with other contexts related to inflammatory bowel disease. Likewise, our study only included CPGs published in English or Spanish, factors that could have contributed to a potential selection bias.

Likewise, having chosen CPGs with well-defined inclusion criteria, it is likely that our results have overestimated the score obtained by selecting guidelines that would score higher than the entire possible universe of CPGs for IBD. Therefore, our conclusions acquire more relevance when evaluating this type of guidelines.

On the other hand, although the degree of agreement reached by the reviewers was moderate (ICC = 0.74), this may be due to the fact that the AGREE II instrument weights each item with a 7-point Likert-type scale, where only the extreme values of this scale are well defined, but it is prone to subjectivity for intermediate values 3, 4 and 5 on the scale. As our research had a large number of reviewers (six), reaching a higher value for the intraclass correlation coefficient (ICC) to improve reliability was difficult. However, we consider that the value achieved does provide adequate reliability [66].

In addition, since the implementation of the AGREE II tool in 2010, it has become the most widely used and popular resource for assessing the quality of CPGs, choosing a cut-off point above which a guideline can be defined as having good quality is subjective and this selection will depend on the context in which the review is being performed. As Brouwers et al. [67] noted, “there is no evidence that if a guideline exceeds a certain score, the recommendations are easier to adopt, or improve processes of care, or lead to better patient outcomes than guidelines that do not achieve that score”.( [67], p.195) That is, the validity of the overall assessment may be limited, as there are no clear rules yet on how to weigh the different domain scores to make a decision on whether or not to recommend guidelines.

What is new and conclusion

Overall, this study determined that the quality of clinical practice guidelines for the diagnosis and treatment of inflammatory bowel disease is acceptable and that there is still room for improvement, especially in terms of stakeholder participation (inclusion of patients, expert methodologists/epidemiologists) and applicability (enablers, barriers, optimization of resources, external review). It is desirable that guideline developers consider these shortcomings in the future for the overall improvement of guidelines’ quality to reduce clinical practice heterogeneity in IBD.