Introduction

Low back pain (LBP) is a major contributor to years lived with disability and a leading cause of limited activity and absence from work [1, 2]. In response to the global burden of LBP, major medical societies or specialized working groups have developed clinical practice guidelines (CPGs) for its diagnosis and management [3, 4]. The principles of CPGs design are well established but the growing multiplication of CPGs has cast doubt on their quality [5]. The current gold standard for the appraisal of CPG quality is the Appraisal of Guidelines for REsearch & Evaluation (AGREE) instrument developed by the AGREE Collaboration in 2003 [6,7,8]. The updated version, known as AGREE II, consists of 23 appraisal criteria (items) grouped into six independent quality domains. There are two overall assessment items: one to evaluate overall CPG quality (overall assessment 1) and one to judge whether a CPG should be recommended for use in practice (overall assessment 2) [9]. Substantial time and resources go into the development of CPGs ex novo, so it may be more efficient to adapt a high-quality CPG (or selected recommendations) for local use, when available [10,11,12]. Systematic reviews authors can apply AGREE II in their critical appraisal of CPGs for LBP [13,14,15,16,17], but stakeholders, clinicians, and policy makers may find it difficult to discern the highest quality CPG when appraisals give different quality ratings of overlapping CPGs. With this study we wanted to determine the proportion of CPGs evaluated in more than one appraisal (i.e., overlapping CPGs) and measure the inter-rater reliability (IRR) and variability of AGREE II scores for overlapping CPGs.

Materials and methods

Meta-epidemiological study

The study was conducted according to the guidelines for reporting meta-epidemiological methodology research [18] since the specific reporting checklist for methods research studies is currently under development (MethodologIcal STudy reporting Checklist [MISTIC]) [19]. The protocol is available on the public Open Science Framework (OSF) repository at https://osf.io/rz7nh/

Search strategy and study selection

We summarized the findings of systematic reviews that applied the AGREE II tool to appraise the quality of CPGs for LBP. We defined these systematic reviews as “appraisals”. For details about the AGREE II instrument, see https://www.agreetrust.org/resource-centre/agree-ii/.

We systematically searched six databases (PubMed, EMBASE, CINAHL, Web of Science, Psychinfo, PEDRO) from January 1, 2010 through March 3, 2021. AGREE II was published in 2010 [6]. The full search strategy is presented in Additional file 1.

Eligibility criteria

Two independent reviewers screened titles and abstracts against eligibility criteria: 1) systematic reviews (i.e., CPGs appraisals) that used the AGREE II tool to evaluate CPGs quality; 2) CPGs for LBP prevention, diagnosis, management, and treatment irrespective of cause (e.g., non-specific LBP, spondylolisthesis, lumbar stenosis, radiculopathy); 3) AGREE II ratings were reported. Included in the present study were appraisals on mixed populations (e.g., neck and back pain) when the data on back pain were reported separately. A third reviewer was consulted to resolve reviewer disagreement. Rayyan software [20] was used to manage screening and selection.

Data extraction

Data were entered on a pre-defined data extraction form (Excel spreadsheet). Two authors extracted the data for: study author, year of publication, protocol registration, number of raters, training in use of the AGREE II tool, population, intervention, exclusion criteria for each appraisal, references of CPGs, AGREE II items/domain scores and two overall assessments: overall assessment 1 (overall CPG quality [measured on a 1-7 scale]) and overall assessment 2 (recommendation for use [yes, yes with modifications, no]). When reported by the appraisers, quality ratings (high, moderate, low) were also extracted.

The reporting of overall assessment varied across appraisals. For overall assessment 2, we collected information about the number of raters who selected the categories “yes”, “yes with modification” or “no” (e.g., 75% raters judged “yes”; 25% “yes with modifications” and 0% “no”) labeling this “raw recommendation for use”. In appraisals that reported only a single recommendation (such as yes) without the percentage for all three categories, we assigned this category by default, labeling this “final recommendation for use” [21].

The corresponding authors were contacted when AGREE II domain scores and overall assessments were not reported. When no response was received, we calculated the domain scores based on AGREE II item scores according to the AGREE II formulas [9].

Data synthesis and analysis

The characteristics of the appraisals eligible for inclusion were summarized using descriptive statistics. Overlapping was defined as how many times a CPG was re-assessed for quality in different appraisals using the AGREE II tool. We measured IRR and variability of the AGREE II domain scores for CPGs that were assessed by at least three appraisals. We used the average intraclass correlation coefficient (ICC) with 95% confidence interval (CI) of the six domain scores to formulate agreement between overlapping CPGs [22]. The degree of agreement was graded according to Landis and Koch [23]: slight (0.01-0.2); fair (0.21-0.4); moderate (0.41-0.6); substantial (0.61-0.8); and almost perfect (0.81-1). For quantitative variables (AGREE II domain scores and overall assessment 1), we measured variability by calculating the interquartile range (IQR) as the difference between the first and the third quartile (Q3-Q1). We measured variability in qualitative variables (overall assessment 2 and quality ratings) as agreement/disagreement of judgments. We defined “perfect agreement” when all appraisals gave the same judgment for the same category (e.g., all judged “high quality” for the same CPGs, IRR=1). Variability of each of the six domain scores for the overlapping CPGs (assessed by at least three appraisals) is reported as mean IQR. Statistical significance was set at P < 0.05. All tests were two-sided. Data analysis was performed using STATA [24].

Results

Search results

The systematic search retrieved 254 records. After duplicates were removed, 192 records were obtained, 163 of which were discarded. The full text of the remaining 29 was examined; 12 did not meet the inclusion criteria (Fig. 1). Finally, 17 appraisals that applied the AGREE II tool were included in the analysis [17, 25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40].

Fig. 1
figure 1

Study flow chart selection

Characteristics of CPG appraisals

Table 1 presents the general characteristics of the 17 appraisals. The median year of publication was 2020 (range, 2015-2021). Eleven appraisals assessed CPGs for LBP and six assessed CPGs not restricted to LBP alone (e.g., chronic musculoskeletal pain). Seven appraisals (41.2%) reported a protocol registration in PROSPERO and two (11.8%) a protocol registration in other online registries or repositories. Three appraisals (17.6%) involved four AGREE II raters and the remaining involved two or three. Six appraisals (35.3%) stated that the raters had received training for using the tool. The rating of all six domains was reported in 14 appraisals (82.4%) [17, 25,26,27,28,29,30,31,32,33,34,35, 37, 40] and the rating of 23 item scores in two [36, 39]. Overall assessment 1 (overall CPG quality) was reported in 11 appraisals (64.7%) [25,26,27,28,29,30, 32,33,34, 37, 40] and overall assessment 2 (recommendation for use) in four (23.5%) [25, 27, 28, 32]. A quality rating (not part of the AGREE II tool) was given in nine appraisals (53%) [17, 26, 28, 29, 33, 34, 36, 37, 39]. One appraisal reported AGREE II ratings in supplementary materials that were unavailable [38]. Four authors [29, 35,36,37] supplied missing data as requested.

Table 1 General characteristics of appraisals

Overlapping CPGs

A total of 43/106 CPGs (40.6%) were overlapping in 17 appraisals (i.e., assessed by at least two appraisals) and 23 CPGs [42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65] had been assessed by at least three appraisals. The six CPGs that most often overlapped were issued by: the National Institute for Health and Care Excellence (NICE) 2016 [63] (9 appraisals), the American College of Physicians (ACP) 2017 [47] (8 appraisals), the American Physical Therapy Association (APTA) 2012 [56] (8 appraisals), the Belgian Health Care Knowledge Centre (KCE) 2017 [43] (6 appraisals), the American Pain Society (APS) 2009 [65] (5 appraisals), and the Council on Chiropractic Guidelines and Practice Parameters (CCGPP) 2016 [55] (5 appraisals). Table 2 presents the overlapping CPGs.12874_1621

Table 2 Overlapping CPGs for LBP

Inter-rater reliability

Table 3 presents the ICC averages of the overlapping CPGs assessed by at least three appraisals. IRR was perfect in 13 CPG ratings (56.6%), substantial in five (21.7%), moderate in two (8.7%), fair in one (4.3%), and slight in two (8.7%). The highest agreement was reached in the ACP 2017 [47], the APTA 2012 [56], and the APS 2009 [65] and the lowest in the NICE 2009 [64], the Toward Optimized Practice Low Back Pain Working Group (TOP) 2017 [53], and the American Society of Interventional Pain Physicians (ASIPP) 2013 [49]. In the most often overlapping CPGs (at least five appraisals), the IRR was perfect in all, except the KCE 2017 [43] (substantial).

Table 3 ICC of overlapping CPGs assessed by at least three appraisals

Variability in domain scores

The most variable domains of overlapping CPGs (assessed by at least three appraisals) were Domain 6 - Editorial Independence (mean IQR 38.6), Domain 5 - Applicability (mean IQR 28.9), and Domain 2 - Stakeholder Involvement (mean IQR 27.7). Among all domains, the most variable CPG was issued by TOP 2017 [53] (mean IQR 51.4) and the least was issued by the Institute for Clinical Systems Improvement (ICSI) 2018 [44] (mean IQR 11) (Table 4). Domain 6 – Editorial Independence was the most variable domain of the CPGs that most often overlapped (assessed by at least five appraisals) (Fig. 2).

Table 4 Domain score variability of overlapping CPGs assessed by at least three appraisals
Fig. 2
figure 2

Variability of six domains of AGREE II applied to the most often overlapping CPGs (assessed by at least five appraisals). The vertical axis represents AGREE II domain scores (0-100), the horizontal axis represents six AGREE II Domains. Legend. Domain 1: Scope and Purpose, Domain 2: Stakeholder involvement, Domain 3: Rigour of Development, Domain 4: Clarity of presentation, Domain 5: Applicability, Domain 6: Editorial Independence. ACP: American College of Physicians; APS: American Pain Society; APTA: American Physical Therapy Association; CCGPP: Council on Chiropractic Guidelines and Practice Parameters; KCE: Belgian Health Care Knowledge Centre; NICE: National Institute for Health and Care Excellence. * NICE 2016 was assessed by nine appraisals but the domain scores were available for eight; ACP 2017 was assessed by eight appraisals but available for seven

Variability of overall assessments 1 and 2

Because of missing data and heterogeneity of reporting (e.g., 0-100 scale or 1-7 scale for overall assessment 1; raw recommendation for use or final recommendation for overall assessment 2), we trasparently reported the judgments of the two overall assessments of overlapping CPGs assessed by at least three appraisals in Table 5. For overall assessment 2, a perfect agreement was achieved  in 5/20 CPG assessments (25%), heterogeneity of reporting in 8/20 (40%), and no complete agreement in 7/20 (35%). For quality ratings (high, moderate, low), a perfect agreement was achieved in 10/19 (53%) while the remaining 9/10 (47%) did not completely agree.

Table 5 AGREE II overall assessment of CPGs assessed by at least three appraisals

Table 6 presents the variability in the most often overlapping CPGs (assessed by at least 5 appraisals) Overall assessment 1 varied the most in the KCE 2017 [43] (IQR 23 on a 0-100 scale) and the least in the NICE 2016 [63] (IQR 9.4 on a 0-100 scale). Agreement in quality ratings was perfect in the NICE 2016 [63] (3/3 high quality), the APTA 2012 [56] (3/3 low quality), and the CCGPP 2016 [55] (2/2 high quality).

Table 6 AGREE II overall assessment of the most often overlapping CPGs (assessed by at least five appraisals)

Recommended CPGs

Additional file 3 lists the CPGs that can be recommended for clinicians based on: overall assessment 2 (i.e., yes recommendation for use); quality rating (i.e., high); agreement of appraisals that overlapped for the same CPG (i.e., perfect agreement as measured by the ICC); and updated status of publication. Overall, NICE 2016 [63] and CCGPP 2016 [55] ranked first and second, respectively.

Discussion

More than one third of CPGs for LBP have been re-assessed by different appraisals in the last six years. This implies a potential waste of time and resources, since many appraisals assessed the same CPGs. Researchers contemplating AGREE II appraisal of CPGs for LBP should carefully think before embarking on a new systematic review and editors should bear in mind that much has already been published. Although the PRISMA [86] and the PROSPERO [87, 88] initiatives have been around for more than 10 years, half (53%) of the appraisals were registered as systematic reviews. Nonetheless, perfect/substantial agreement in 78% of AGREE II ratings confirmed the CPG quality. Agreement was highest in the ACP 2017 [47], the APTA 2012 [56], and the APS 2009 [65], and lowest in the NICE 2009 [64], the TOP 2017 [53], and the ASIPP 2013 [49].

Here we compared similarities and differences across appraisals. A plausible explanation for the discrepancy in the degree of agreement on CPGs is that the AGREE II tool includes different information within a single item. Raters may focus their attention on some aspects more than others because there is no composite weight of judgement [55]. In addition, discordances may stem from the availability and ease of access to supplementary contents to better address domain judgment. AGREE II does, however, recommend that raters read the clinical CPG document in full, as well as any accompanying documents [9].

Analysis of variability within domains of the appraisals that assessed the same CPG showed that the two most variable domains were Domain 6 – Editorial Independence and Domain 5 – Applicability and Domain 2 – Stakeholder Involvement. There was poor reporting for some CPGs in Domain 6 item scores, resulting in potential financial conflict of interest between CPG developers, stakeholders, and industry [89]. Conflict of interest can arise for anyone involved in CPG development (funders, systematic review authors, panel members, patients or their representatives, peer reviewers, researchers) [90] and have an impact on biased recommendations with consequences for patients [91, 92]. Affiliation, member role, and management of potential conflict of interest in the recommendation process must be transparently reported to improve judgment consistency. There is an important difference between declaring an interest and determining and managing a potential conflict of interest [93, 94]. While not all interests constitute a potential cause for conflict, assessment must be fully described before taking a decision [95]. Furthermore, inadequate information results in an unclear conflict of interest statement, which can open the way to subjective judgment and variation in the scores for this domain. One solution would be to have a document that identifies explicit links between interests and conflict of interst for each CPG recommendation, so as to give a transparent judgment.

Unsurprisingly, Domain 2 – Stakeholder Involvement also varied widely because it shares the same issue of the description of CPG development groups. This domain presents broad assessment of patient values, preferences, and experiences (e.g., patients/public participation in a CPG development group, external review, interview or literature review), which could be perceived as valid alternative strategies and not a combination of actions. For example, one would expect patient involvement on a LBP CPG development panel rather than consultation of the literature on patient values. This choice reflects patient involvement because it influences guideline development, implementation, and dissemination. CPGs developed without patient involvement may ultimately not be acceptable for use [96].

Domain 5 – Applicability was found poorly and heterogeneously reported in other conditions, too [5, 97]. One reason for domain variability is that the items in this domain often rely on information supplementary to the main guideline document. Supplementary documents may sometimes no longer be retrievable, especially if the CPG is outdated. Implementation of CPGs is not always considered an integrated activity of CPG development. Without an assessment of CPG uptake (e.g., monitoring/audit, facilitators, barriers to its application), its recommendations may not be fully and adequately translated into clinical practice [5]. In some cases, monitoring is not enough without indications or solutions to overcome barriers. Balancing judgments is difficult and may result in variability for this domain.

Finally, due to missing data (overall assessment 1 not reported in 35% of appraisals; overall assessment 2 not reported in 76% of appraisals) and heterogeneity of reporting (1-7 point or 0-100 point scales; final recommendation for use or raw recommendation for use), we found it difficult to synthesize agreements and provide implications for clinical practice. Though not mandatory in the AGREE II tool, a quality rating (high, moderate, low) was reported in 53% of appraisals but agreement was perfect in only half of the appraisals. Our findings are consistent with a previous study on CPG appraisals in rehabilitation in which reporting of the two overall assessments was poor and the quality ratings differed from low to high in more than one fourth of approaisals when different cut-offs were applied to rate the same CPG [21].

In general, variability can be partly explained by the different number of items in each domain, the number of raters, and the subjective rating of AGREE II items that can be differently weighted as leniency and strictness bias [98].

Another factor that could explain variability is the suboptimal use of the AGREE II tool: 65% of the CPG appraisals in our sample did not provide information on whether the raters had received training in use of the AGREE II tool [99] and only 18% involved at least four raters, as recommended in the AGREE II manual [9]. Clinical and methodological competences should always be well balanced among raters,and reported to ensure adherence to high standards. We strongly suggest appraisals report whether raters have received AGREE II training [99]. Some issues with AGREE II validity may arise (e.g., AGREE II video tutorials; “My AGREE PLUS” platform) [100] when the training resources are not consistently updated.

Strengths and limitations

This is the first meta-epidemiological study to examine the overlapping of appraisals applying the AGREE II tool to CPGs for LBP. The sizeable sample of appraisals encompassing CPGs for LBP prevention, diagnosis, and treatment supports the external validity of our findings. Nevertheless, some limitations must be noted. We used as the unit for analysis the overlapping CPGs assessed by at least three appraisals, including CPGs assessed by up to eight appraisals, which may have increased judgment variability. On the conservative side, however, when we restricted our analysis to CPGs assessed by at least five appraisals, the results showed patterns similar to the larger primary sample. We then assessed the variability of overall assessments and quality ratings reported by appraisals when the data were available and homogeneously reported. We did not standardize or convert judgments when the data were reported heterogeneously (e.g., 1-7 point scale or 0-100 scale; final recommendation or raw recommendation for use). This cautious strategy meant that we could not measure the variability of overall assessments for the whole sample since the data were missing from 35% (overall assessment 1) to 76% (overall assessment 2) of appraisals. The percentages of poor reporting are known [97, 101, 102] and similar findings were documented for a large sample of CPGs on rehabilitation (35% overall assessment 1 and 58% overall assessment 2) [21].

Implications

We suggest that time and resources in conducting LBP appraisals can be optimized when appraisal raters follow the AGREE II manual recommendations for conducting (e.g., number of raters; AGREE II training) and reporting (e.g., overall assessment 2). Before starting a new appraisal, researchers should check academic databases and systematic review registers (e.g., PROSPERO) for published appraisals. Also journal editors could help reduce redundancy by checking compliance with the AGREE II manuals and high-quality standards of reporting for manuscript submissions. Finally, the AGREE Enterprise should invest efforts to promote more transparent and detailed reporting (i.e., support of judgment for AGREE II domains and overall assessments). Considering a wide evaluation including overall assessment 2 (i.e., yes recommendation for use), quality rating (i.e., high), agreement of appraisals that overlapped for the same CPG (i.e., perfect agreement) and updated status of publication, we found that NICE 2016 [63] and CCGPP 2016 [55] would be of value and benefit to clinicians in their practice with LBP patients.

We are aware that a CPG has a limited life span between systematic search strategy to answer the clinical questions and year of publication of the guideline itself [27]. The validity of recommendations more than three years old is often potentially questionable [103].

Conclusion

We found that more than one third of the CPGs in our sample had been re-assessed for quality by multiple appraisals during the last six years. We found poor and heterogeneous reporting of recommendations for use (i.e., overall assessment 2), which generates unclear information about their application in clinical practice. Clinicians need to be able to rely on high quality CPGs based on updated evidence with perfect agreement by multiple appraisals.