Rating the quality of evidence and the strength of recommendations using GRADE
- First Online:
- Cite this article as:
- Canfield, S.E. & Dahm, P. World J Urol (2011) 29: 311. doi:10.1007/s00345-011-0667-2
- 930 Views
Urologists can benefit from a standardized system for guideline development and presentation. This article introduces the GRADE system and explains how it may be useful for Urologic physicians, in their practice and in their healthcare systems.
The GRADE system is reviewed. Specific aspects of how GRADE rates the quality of the evidence and the strength of recommendations are explored.
GRADE can provide explicit and structured guidance, which separates the quality of evidence from the strength of recommendations. This information can be used by consumers of guidelines, including patients, physicians, and policy makers.
Urologists can benefit from a more transparent and rigorous framework when formulating recommendations. GRADE is an emergent proposal with broader implications for healthcare policy as well.
KeywordsGuidelinesLevels of evidenceUrology
Urologic guidelines can be difficult to compare due to wide variations in how they rate the quality of evidence and present their recommendations. Often, the authors cannot offer an actual recommendation due to perceived gaps in the evidence, or when a recommendation is offered, it may be unclear what the ranking of the recommendation implies. Unfortunately, such guidelines may not provide much actual guidance and may introduce confusion. Grading of Recommendations Assessment, Development, and Evaluation (GRADE) is a system which provides clear and concise information on both the quality of the evidence and the strength of the recommendation. The system can be used when developing systematic reviews and when formulating recommendations in the context of guidelines. Information on patient important outcomes is presented in a systematic and explicit fashion which can be used by physicians, patients, and policy makers.
Why is a better system needed?
Examples of current guideline rating and grading formats
A—randomized controlled trials (RCTs) or systematic reviews of RCTs
1—high-quality evidence along with uniform level of consensus
A—clinical studies of good quality and consistency addressing the specific recommendations and including at least one randomized controlled trial
B—non-randomized trials and other observational studies
2a and 2b—lower-quality evidence with uniform or non-uniform consensus
B—well-conducted clinical studies, but without randomized clinical trials
3—any quality evidence but with major disagreement among panelists
C—recommendations made despite the absence of directly applicable clinical studies of good quality
These three examples illustrate the diverse approach undertaken by just a few of the many important organizations which seek to provide practical guidance to urologists. Yet an attempt to compare the information provided by these different groups would prove difficult. The methodology used is too different, and there is no cross-walk from guidelines of one organization to another. In many cases, it also remains unclear what the considerations were how guideline developers arrived from a given level of evidence to a certain recommendation. In other cases, no specific recommendation is made. A structured format for guideline development and presentation which rates the quality of available evidence and defines the factors involved when grading the strength of the recommendations would be a powerful tool to synthesize information and could provide better guidance to practitioners.
What is GRADE and how was it developed?
The GRADE system was created in response to the need for a more unified and transparent approach to guidelines creation and reporting . Individuals from all over the world, many from leading organizations involved in defining levels of evidence, including NICE, ARHQ, and the National Health and Medical Research Council (NHMRC) formed the GRADE working group in 2000 and have since been working in the development of the GRADE system . This framework has now been adopted as the standard for guideline development by over 50 international organizations, including the World Health Organization (WHO), the Cochrane Collaboration, SIGN, AHRQ, and the Centers for Disease Control and Prevention (CDC). Current resources on the GRADE methodology include the original series for guideline developers published in the British Medical Journal [7–11] and the GRADE working group website .
How does GRADE actually work?
Determining the evidence quality
Example of a theoretical evidence profile for medical expulsive therapy with alpha-blockers 
Summary of findings
Estimate of effect
Stone free rate
Moderate heterogeneity detected*
No important imprecision
Suggestion of mild publication bias
RR = 1.45; 95% CI: 1.34–1.57
++ , low
No important inconsistency
No important imprecision
Suggestion of mild publication bias
Analgesic requirement were lower for all studies compared to placebo
++ , low
Very serious (−2)
No important inconsistency
Publication bias likely
Sparse reporting, range: 0–12%
+, very low
Definitions for GRADE quality ratings and strength of recommendations
Quality of evidence
Strength of recommendation
Strong recommendation for using an intervention
Weak recommendation for using an intervention
Weak recommendation against using an intervention
Very low quality
Strong recommendation against using an intervention
Factors that influence quality of evidence
Downgrading the evidence
Quality is lowered by:
Upgrading the evidence
Quality is raised by:
Limitations of study design
Large magnitude of effect
Inconsistency of results
Confounding which would reduce the effect
Indirectness of evidence
Reporting or publication bias
Factors which can lower quality
Limitations that may increase bias within study results include lack of allocation concealment, lack of blinding, especially for subjective outcomes, failure to adhere to intention to treat principles, large loss to follow-up, failure to report on all patient important outcomes, especially ones which would logically have been measured, and stopping early for apparent benefit. Inconsistent results (termed heterogeneity in systematic reviews), which are widely different across different studies, raise concerns that there may be true treatment differences, which cannot be explained. This becomes problematic when attempting to generalize a treatment recommendation. Indirect evidence occurs when different studies assess unique components of a question, such as results with open or laparoscopic surgical procedures, but not via a direct comparison. The results can then be compared indirectly, but this may lower the quality of that evidence. Commonly, this applies when different medication classes have been studied for the same outcome, but independent of one another, such as calcium channel blockers and alpha-blockers as medical expulsive therapy (MET) for ureteral stone passage . Indirectness can also apply to any of the components that inform a clinical question—different patient populations, different interventions, different comparators, and different outcomes. Imprecision occurs in situations with few events that present wide confidence intervals. Even for a well-designed trial, imprecision will lower the quality rating because we are not as confident in the results. Finally, the suggestion of reporting or publication bias clouds the existing evidence with a level of uncertainty about unpublished results or studies, which also lowers the rating.
Factors which can raise quality
Observational studies cannot overcome certain bias inherent in non-randomized trials and therefore default to low-quality evidence. However, certain features can increase the quality of such studies. When methodologically strong observational studies yield large or very large and consistent estimates of the magnitude of a treatment effect, we may be confident about the results. In those situations, although the observational studies are likely to have provided an overestimate of the true effect, the weak study design is unlikely to explain all of the apparent benefit. Dose–response gradients, where increased medication doses correspond directly to increased effects, also provide stronger evidence. Finally, when potential bias would tend to oppose the effect seen, rather than enhance it, we can surmise the true effect may be even greater than what is reported. This can raise the quality rating. For example, investigators assessed treatment intensity for early-stage bladder cancer, out of concern that such intensity was not evidence based nor cost-effective . Logically, more intense treatments might be expected to correlate with improved outcomes. The bias should be toward improved survival in this group, due to confounders such as more motivated and healthier patients who were willing and able to undergo more intense treatment regiments. Yet the study showed no benefit to more intense treatment. Because of the direction the bias would naturally take these results, if it were removed it would be even less likely that a benefit to intense treatment would be found. Of note, all these determinations of study quality are typically made about an entire body of evidence as represented in a systematic review and meta-analysis, rather than an individual study.
Determining the strength of a recommendation
Recommendations must always balance the desirable effects of an intervention with the undesirable effects. It is often unclear how individual patients may feel about this balance. This complexity is where GRADE attempts to provide consistency for developers and transparency for consumers. The strength of a recommendation will reflect the level of confidence we have that implementing a recommendation will do more good than harm. Many guideline systems grade their recommendations, but these may be cumbersome to interpret. The GRADE system strives for simplicity. To that end, it allows developers to ponder the quality of the evidence with three other equally important factors to derive a yes or no recommendation, with the ability to explain why the recommendation is either “strong” or “weak”  (Table 3). GRADE also discourages guideline developer from judgments such as “no recommendations can be made” since clinicians and patient have to make a decision. Rating the strength of the recommendation allows developers to use judgment within the rules of the GRADE framework, and the transparent presentation allows users to decide if they agree with those judgments.
Grading of Recommendations Assessment, Development, and Evaluation provides specific definitions of what their recommendations should signal to different individuals: A strong recommendation implies that most patients would want the intervention and that physicians should routinely offer the intervention, and policy makers may adopt the practice in most situations. A weak recommendation implies that although the majority of fully informed patients would still want the intervention, a substantial proportion of patients would not and that physicians should therefore offer a discussion of alternatives, and policies may reflect the uncertainty. Best practice policy based on weak recommendations may, for example, be to ensure a discussion of options occurs.
Grading of Recommendations Assessment, Development, and Evaluation proposes four factors to determine the strength of a recommendation . The first, quality of the evidence, is a key component. This rating refers to the overall quality of evidence across outcomes. The lower the quality of the evidence, the more likely a weak recommendation may be warranted. The second factor in determining recommendation strength is the level of certainty regarding the balance of advantages and disadvantages of an intervention. For an intervention with clear benefit and few side effects, such as a single dose of antibiotic to prevent urinary infections associated with certain urologic procedures , the decision for a strong recommendation may be clear. Even when disadvantages are significant, such as with high-dose chemotherapy for testicular cancer, a strong recommendation may reflect the fact that despite the toxicity of chemotherapy, most young men would choose to undergo this therapy for the survival advantage it affords . There is little uncertainty about this balance. When uncertainty exists for benefits and harms, a weak recommendation may be appropriate. For example, older generation anticholinergics to treat overactive bladder often cause significant dry mouth and constipation . The relative balance of these effects would likely differ greatly among patients, leading to large uncertainty for a recommendation.
A critical component to evidence-based medicine is incorporating patient values and preferences into clinical care . This step informs the third factor in determining the strength of a recommendation in the GRADE system. When there are large variations in how different patients may value aspects of the treatment or outcomes, it is more likely that a weak recommendation may be appropriate. Such may be the case with treatments for localized prostate cancer which, for example, may have different outcomes for erectile function. It may be inappropriate to make a strong recommendation for one specific treatment over another because each treatment has a complex set of associated quality of life outcomes, which will be weighted differently by different patients . A weak recommendation in this setting will prompt a discussion of options. Alternatively, patients who wish to improve the likelihood for return of continence after radical prostatectomy may be encouraged to perform pelvic floor exercises . This intervention may merit a strong recommendation despite lower quality evidence because of the strong preference men in this treatment group are likely to have in favor of it and its low likelihood of harm. Any strong recommendation would also hinge on a determination that the benefits outweigh the potential harms (as outlined above).
The fourth component the GRADE system attempts to incorporate in determining the strength of a recommendation is cost. Resource allocation is becoming a vital component in healthcare policies around the world, and guideline developers may wish to take this component into careful consideration . Guidelines will be used to guide policy as well as individual decision making. Cost may be more difficult to assess than other factors, especially when trying to determine broader costs like those incurred by society as a whole. Therefore, cost is best viewed with perspective and context. It is not as simple as stating that higher cost requires a weak recommendation. For example, new vaccine therapy for metastatic castrate-resistant prostate cancer is extremely expensive . Policy makers would likely wish guideline developers to assume a societal perspective and factor cost into their recommendations, while individual patients with appropriate healthcare insurance may not care to be limited by cost. To what extent guideline developers consider costs varies greatly between countries; for example, the United States and the United Kingdom are at opposite extremes. While guidelines in the United States typically do not formally consider costs in the guideline development process, NICE guidelines are always accompanied by a formal cost-effectiveness analysis.
How can GRADE help urologists?
Having clear, concise, and transparent guidance on urological issues may be of benefit to any busy clinician. A physician treating a patient with a small ureteral calculus who is uncertain about the use of MET might consult the EAU/AUA Nephrolithiasis Panel’s Clinical Guideline on Management of Ureteral Calculi . There the physician will find this “Option”: “A patient who has a newly diagnosed ureteral stone <10 mm and whose symptoms are controlled may be offered an appropriate medical therapy to facilitate stone passage during the observation period” and this “Standard”: “Patients should be counseled on the attendant risks of MET including associated drug side effects and should be informed that it is administered for an “off label” use.” One interpretation of these recommendations might be that the disadvantages of these medications seem to outweigh the benefits, as risk discussion is given a standard rank while usage is merely an option. Yet there is no statement that this therapy is not recommended. What is missing is a clear way to assess the potential effect of the therapy, the quality of evidence this is based on, and what decisional factors are present which result in the recommendation being an option only. A separate assessment of the quality of evidence and an explanation of the strength of recommendation might guide the clinician in a more practical manner. For example, such an assessment performed utilizing the GRADE methodology would reference an evidence profile (Table 2) and might provide the following statement: In patients being observed for spontaneous passage of a ureteral calculus, a weak recommendation can be made for medical expulsive therapy in facilitating stone passage and reducing analgesic needs while limiting exposure to adverse events, based on low-quality evidence for passage rates and analgesic needs, and very low-quality evidence for adverse events.
How can GRADE shape our future?
The GRADE system is well designed to meet the challenges that are facing the urological community in the current era of healthcare. Comparative effectiveness research (CER) has become a distinct entity which will be relied on to inform the spectrum from clinical decision making to national health policy. A well-defined, reproducible, and consistent system that focuses on patient important outcomes but also factors in patient values and system and societal costs is ideally suited to address these challenges and provides real guidance at all levels. Quality healthcare is becoming an integral component in many systems and should be defined as a measurable outcome. The GRADE system can help identify how to measure quality for policy makers, even within the most complex decisions. For example, treatments for localized prostate cancer cannot be boiled down to a single recommendation for every patient. How can quality be measured when many complex options exist? The GRADE system can help make explicit how patient values and preferences influence the decision-making process in this disease. A recommendation for a balanced discussion of options and outcomes in this setting also identifies that process as a quality indicator. The GRADE system also provides the option to incorporate cost into the recommendations. In the United States, end-of-life care is becoming a large part of healthcare spending . With many new end-of-life extending medications with large price tags on the horizon for prostate cancer, for example, Urologists will critically need a system with the ability to assess resource allocation. A structured and explicit guideline system that can be used in our globalized world would lend clarity and provide guidance. The adoption of GRADE by so many current organizations suggests it may be such a common framework.
This article relies heavily on the landmark series published in the British Medical Journal by the GRADE working group.
Conflict of interest
Dr. Dahm is a member of the GRADE working group.