Background

The grading of recommendation, assessment, development and evaluation (GRADE) approach provides guidance on rating the quality of research evidence in health care [1]. This approach has been widely implemented by organisations such as the World Health Organization, Cochrane Collaboration, Agency for Healthcare Research and Quality (USA) and National Institute of Health and Care Excellence (UK).

The GRADE approach is comprehensively described in an online manual (freely available for download with the GRADEpro software at http://tech.cochrane.org/revman/gradepro). This is summarised in a series of papers published in the BMJ in 2008 [2] and is also explained in a more detailed collection of papers in the Journal of Clinical Epidemiology (JCE) beginning in 2011 [1, 315]. The GRADE approach is used to assess the quality of evidence for a specific outcome across studies. It applies most directly to a meta-analysis undertaken in the context of a systematic review but can be applied also to individual studies or non-quantitative syntheses when meta-analyses are not available. Evidence from randomised controlled trials (RCTs) begins as high-quality evidence but can be downgraded according to five factors: risk of bias, inconsistency, indirectness, imprecision and publication bias. Evidence from non-randomised studies begins as low-quality evidence, but their rating can be upgraded (provided no other limitations have been identified according to the five factors). Upgrading occurs for three primary reasons: large magnitude of effect, evidence of a dose-response effect and all plausible confounding taken into account. After the process of downgrading or upgrading, the quality of the evidence for each outcome ends up rated as high, moderate, low or very low. For the purposes of this paper, we will focus on the application of GRADE to meta-analyses of RCTs.

A major advantage of GRADE is that it leads to more transparent judgements about the quality of evidence [16]. The five key specific factors that may lead to downgrading the quality of evidence (and the need to state reasons for each downgrade) help to provide a clear rationale for such judgements. In addition, there is detailed written guidance from the GRADE working group for making these judgements. A further advantage of the broader GRADE framework is its ability to indicate the strength of recommendations based on the evidence.

However, a challenge to this extensive guidance is the complexity of the approach. There are substantial conceptual challenges when an organisation moves from previous methods of assessing overall study quality to implementing GRADE [16]. For example, it may be difficult for a researcher (particularly if they have limited experience of systematic reviews and/or GRADE) to remember and apply this guidance in a consistent manner (both within assessments as an individual, and between researchers in the same team) in the often time-pressured environment of conducting systematic reviews and guideline development.

Two recent empirical studies have examined the consistency of GRADE assessments. Hartling et al. [17] found poor-to-moderate agreement (based on kappa values) between researchers experienced in systematic review methods on rating quality of evidence. The GRADE working group [18] recently conducted an evaluation of consistency which was stratified by level of experience of the raters (members of the GRADE working group, students on the health research methodology programme at McMaster University). Inter-rater agreement was moderate when comparing individual ratings conducted independently by two researchers for both experienced members of the GRADE working group and for members of the student group. Inter-rater agreement was higher for members of the GRADE working group, when ratings were based on two individuals working independently who then reached a consensus judgement, then compared this judgement with another pair using the same consensus methods. Both studies suggest that GRADE ratings may not result in sufficient inter-rater agreement if conducted only by one assessor. Two experienced reviewers conducting a GRADE assessment in duplicate appears to result in acceptable agreement. However, this is less clear for inexperienced users of GRADE [18]. Therefore, further resources may improve consistency in conducting GRADE assessments by reviewers with less experience or for those with insufficient time and resources to conduct assessments by two reviewers independently.

We developed a checklist for evaluating meta-analyses of RCTs for the purpose of informing a GRADE assessment. The checklist covers the main determinants for each of the five factors (risk of bias, inconsistency, indirectness, imprecision, publication bias) that can lead to a downgrading of quality in the GRADE system. In this paper, we describe the development of the checklist and report on agreement between independent raters on each of the constituent items.

Methods

Developing the checklist

Two authors (GS, NM) drew a logic model which aimed to represent the GRADE assessment process (see Figure 1). The logic model comprises nodes for the various characteristics and properties of the evidence, connected with arrows to indicate how these attributes impact on each other. Population of the model was based on articles written by the GRADE working group with particular emphasis on the most recent series of articles in the JCE[1, 315].

Figure 1
figure 1

Logic model for developing the checklist based on GRADE criteria. A logic model illustrating how the checklist items relate to the risk of bias, inconsistency, indirectness, imprecision and publication bias domains in GRADE.

From this logic model, we developed a checklist of questions to extract the data required for conducting a GRADE assessment (see Figure 1). The full checklist appears in Additional file 1. Where possible, all questions were derived as directly as possible from literature produced by the GRADE working group. However, where specific guidance was lacking, some of the questions included our value judgements (e.g. what constitutes a low, medium or high value of I 2; or the extent to which confidence intervals overlap; or what constitutes a high dropout rate).

Questions about study limitations (risk of bias) were based on items in the Cochrane risk of bias tool [19] as suggested by the GRADE working group [6]. They are to be answered in relation to the majority of the aggregated evidence in the meta-analysis rather than to individual studies. Questions about inconsistency were based primarily on visual assessment of forest plots and the statistical quantification of heterogeneity based on I 2 and Q statistics. This includes assessments of subgroups (particularly those defined a priori) that appear to explain the inconsistency.

Indirectness items included questions on the applicability of the population, intervention, comparator and outcome (whether a surrogate or not, and whether follow-up time was sufficient) based on the majority of the aggregated evidence in the meta-analysis. The checklist does not specifically address reviews using network meta-analysis (as there is not currently any published guidance on this). However, the checklist did consider indirectness in terms of informal indirect comparisons made between interventions based on pairwise meta-analysis.

Imprecision was addressed through items relating to the width of the confidence interval and sample size. When judging the width of the confidence interval, GRADE recommends that reviewers use a clinical decision threshold to assess whether the imprecision is clinically meaningful [8]. We did not explicitly build this into the checklist, although the initial question about whether the estimate is consistent with benefit and harm might be interpreted in relation to a minimally important difference.

Questions about publication bias addressed comprehensiveness of the search strategy, whether included studies had industry influence and funnel plot asymmetry and whether there was evidence of discrepancies between published and unpublished trials.

We focused specifically on meta-analyses of RCTs. Therefore, we did not include factors to upgrade the evidence (magnitude of effect, dose-response relationship, adjusting for known confounding) as these are recommended for use only in the context of systematic reviews of non-randomised studies where downgrading was not required for any of the five potential factors.

Initial piloting of the checklist

We conducted two phases of initial piloting of these questions. First, the wider project team applied the draft checklist independently to five systematic reviews and then met to discuss their assessments and any difficulties they experienced. Second, a set of 15 systematic reviews was assessed by two reviewers independently to identify any further potential difficulties. Both phases led to some refining of the instrument. One of the challenges identified by the team in the second phase was that the response ‘yes’ referred to something positive (e.g. was the sequence generation randomised?) in some items and negative in others (e.g. was there selective reporting?). Several reviewers suggested it would be more straightforward to use the checklist if all questions were worded so that ‘yes’ always referred to a positive feature of the quality of evidence and ‘no’ to something negative (where yes and no responses were required). Therefore, we included this modification in the main validation of the checklist.

External validation of the checklist

Following the development and initial piloting of the checklist, we conducted a more formal evaluation of its inter-rater agreement. We examined 29 systematic reviews containing a meta-analysis of RCTs included in the Database of Abstracts of Reviews of Effects (DARE) [2048] (Table 1). One author (NM) selected the reviews to ensure variety both in terms of reporting and reliability based on an informal assessment of the systematic reviews. Papers were selected to ensure a diverse range of reviews from a variety of disease areas including cardiology, diabetes, chronic obstructive pulmonary disease, neurological conditions (dementia, stroke, Parkinson's disease), oncology (non-small cell lung cancer, ovarian cancer, metastatic colorectal cancer) and prevention of pancreatitis, osteoarthritis, restless legs syndrome and sciatica. In addition, reviews on substance misuse (opioid detoxification, psychostimulant dependence), mental health (depression, anxiety, obsessive compulsive disorder, attention deficit hyperactivity disorder) and HIV prevention were also included. Pharmacological and non-pharmacological interventions were included.

Table 1 Brief summary of systematic reviews included in pilot evaluation

The checklist was used by two reviewers independently for all selected reviews. For each review, one author (NM) developed a review question in terms of the population, intervention, comparison and outcome (PICO) of interest. One critical outcome (the primary outcome) was selected for assessment for each review. GRADE recommends a maximum of nine outcomes, but we limited it to one outcome for the purposes of this study to enable reviewers to assess a wider sample of reviews.

Reviewers based their assessments on the information reported in the paper relating to the systematic review and did not seek to obtain information from the original reports of included RCTs (other than for evidence of industry involvement which was rarely reported in systematic reviews and therefore original papers of the largest studies in the meta-analysis were checked).

We calculated a weighted kappa statistic with 95% confidence interval (CI) for each item of the checklist. We interpreted the coefficients according to the following guidelines: below chance was considered poor; 0.01 to 0.20 slight agreement; 0.21 to 0.40 fair agreement; 0.41 to 0.60 moderate agreement; 0.61 to 0.80 substantial agreement; and 0.81 to 1 almost perfect agreement [49]. Each reviewer recorded the time spent conducting an assessment, and from these data, we summarised these to estimate the likely resource implications of using this approach.

Experience of reviewers

Seven reviewers from the Centre for Reviews and Dissemination (CRD) participated in the checklist assessments. CRD specialises in conducting, critically appraising and developing methods for evidence syntheses and systematic reviews. Years of systematic review experience among participants ranged from 2 to 10. Although six reviewers had no formal training or experience using GRADE, all were experienced in critical appraisal and validity assessment (of both primary studies and systematic reviews), for example, the Cochrane risk of bias tool and quality assessment according to criteria developed by CRD.

Two authors (GS, NM) provided informal GRADE training during the course of the study and provided guidance on completing the checklist.

Results

Judgements for each of the checklist items

Measures of agreement for all specific items in the checklist are given in Table 2. For most of the items designed to examine risk of bias, agreement was found to be either almost perfect or substantial. For one item, designed to measure attrition bias, agreement was moderate. For items concerning no other bias and selective reporting, level of agreement was relatively low.

Table 2 Agreement for all checklist items

For all items on imprecision, there was almost perfect agreement which reflects the fact that less judgement is needed for these items, and therefore, a high level of agreement would be expected.

Four of the five items on inconsistency had almost perfect or substantial agreement. The item on the extent of overlap of confidence intervals was associated with moderate agreement.

Two of the items on indirectness (applicability of population and applicability of intervention) did not perform well, and the consistency was below chance. For two of the items (use of surrogate outcome and direct comparisons), agreement was almost perfect, and for one item moderate (sufficiency of follow-up time for outcome).

For five out of six items on publication bias, there was substantial or almost perfect agreement. Only fair agreement was found on the item for searching grey literature.

Time required for conducting assessments

The median time to conduct the assessment was 30 min (range = 15–90 min). Informal assessment suggested that years of experience of reviewers and complexity of the review appeared to be the main factors impacting on variability of time taken to complete assessment.

Reviewers' feedback suggested that it was relatively straightforward to use the checklist despite almost all reviewers having no prior formal training or experience using GRADE.

Discussion

Some of the main difficulties in applying the GRADE system are the complexity and cognitive demands of the approach and the potential lack of reproducibility of the judgements made. Our proposed checklist helps researchers to identify and extract the information needed for a GRADE assessment in a repeatable manner. In addition, it increases transparency because the items used to make the assessment are clearly identified.

For most checklist items (70%), there was substantial agreement or better, suggesting that most of the data required for a GRADE assessment can be extracted in a repeatable manner.

However, some problems were identified for checklist items relating to indirectness. There was a lack of agreement for most of these items, although there was good agreement on surrogate outcomes. This reflects GRADE guidance which acknowledges that considerable judgement is needed when considering the applicability with regard to populations and interventions [10]. In addition, it is likely that given the wide scope of the reviews being quality assessed, there may have been a lack of clarity about the target population and intervention. As assessments were conducted independently, it was not possible to discuss how to apply criteria for assessing indirectness (or consult with advisors with expertise in the subject area), which is common practice when working in a team of reviewers.

A potential limitation of this study is that we conducted the GRADE assessments on published reviews, therefore basing our judgement on the data reported in these reviews rather than as part of the process of conducting a systematic review. It is possible agreement would either increase or decrease if we used the latter approach. However, our approach reflects the reality of guideline development and of conducting overviews of reviews where evidence from existing systematic reviews will often be utilised to inform conclusions and/or recommendations. It is also consistent with the methods used in a previous evaluation of the inter-rater agreement for GRADE assessments [18].

A further potential limitation is that the systematic reviews we assessed were not randomly selected. However, for the purposes of this study, we were not aiming to provide a random sample of all systematic reviews. Instead, we wanted to ensure that reviews from a variety of disease areas and using different types of interventions were examined during the assessment of the performance of the checklist.

The checklist only includes criteria relating to systematic reviews that include meta-analyses of RCTs. Of course, the GRADE approach can be applied to non-randomised studies and also synthesis techniques other than meta-analysis. Further work is needed to adapt and validate this checklist for such reviews. A further possible limitation was that effort to make all ‘yes’ responses to questions reflect low bias may have resulted in awkward wording (e.g. use of double negatives) for some items. This may have reduced the user-friendliness of the checklist and potentially reduced inter-rater agreement.

While aiming to reduce some of the complexity of conducting GRADE assessments, it is important to note that reviewers still need to develop clearly defined review questions (including the selection of critical or important outcomes) before completing the checklist. In addition, using the checklist still requires the reviewer to make important value judgements. For example, as noted above, considerable judgement is needed to assess the applicability of populations, interventions and comparisons for a particular review question. Similarly, reviewers will need to develop thresholds to assess, for example, whether the width of the confidence interval for a pooled estimate (for questions on imprecision) and variability in study estimates (for questions on inconsistency) included in the meta-analysis constitute clinically meaningful threats to validity.

In addition, the issue of improving repeatability of GRADE assessments that we address in this paper is one among many challenges in the use of GRADE. One of these other challenges is that of how to decide whether to downgrade by none, one or two levels for issues identified in one of the five GRADE domains. Considerable guidance is available from GRADE on how these different factors should be weighted, and scenarios provided where a combination of factors should likely lead to a downgrade by one or two levels. Our checklist provides added structure for the information to inform this decision but does not necessarily highlight when there is sufficient evidence to justify any particular degree of downgrading.

To this end, we have developed a semi-automated quality assessment tool (SAQAT) which is a Bayesian network model that uses responses to the checklist questions to produce GRADE assessments. Although semi-automated approaches have not been widely used in critical appraisal of systematic reviews, they may offer an alternative for reviewers who struggle with the complexity of GRADE. Manuscripts about the SAQAT and its performance in practice are currently in development.

Conclusions

In conclusion, experienced systematic reviewers but with little or no experience of conducting GRADE assessments appear to be able to answer our checklist of questions in a broadly consistent and reproducible manner when assessing the quality of evidence for meta-analyses of RCTs. Further work is needed to improve agreement on judgements relating to applicability of interventions and populations (as factors to consider within indirectness).

Use of our checklist is most likely to benefit those with limited experience of using GRADE. However, given the complexity of GRADE, we think the checklist may act as a helpful reminder for experienced users of the key factors to consider for each of the five GRADE domains (risk of bias, imprecision, inconsistency, indirectness and publication bias). Our checklist may offer improvements in efficiency and time and therefore may be beneficial when used in the context of a rapid review. For example, we found that inexperienced users of GRADE were able to complete an assessment in a median of 30 min.

Next steps include the need to pilot the use of this checklist when conducting systematic reviews and/or during guideline development and assess whether this results in more consistent judgements when conducting a GRADE assessment. It would also be important to assess further the utility of using the checklist for different review questions and disease areas, and the extent to which it might need to be adapted. We particularly encourage researchers and health technology assessment organisations currently using or considering using GRADE to pilot our tool and provide us with feedback to inform further development.