Background

Prognosis research examines the progression of a health condition over time in order to identify risk and protective factors that can alter the likelihood of a future event during the course of such a condition [1]. The evidence about the progression of a health condition derived from prognosis research is crucial to make informed decisions about the process to identify individuals who are at risk for poor outcomes, to facilitate early intervention and guide the development of preventive interventions that target modifiable prognostic factors [2]. Synthesizing results from prognosis research for these potential uses is, however, often challenging as primary study results are often inconsistent and difficult to interpret [3]. Individual study differences may be due to, for example, small sample sizes, adjustments for different variables in the analyses, or consideration of different subsets within the same population [47]. Standards have been proposed to guide the design, procedures, analysis and reporting of this research in an attempt to minimize the variability and improve the quality of the primary studies [47]. However, these standards are often not followed causing confusion about the prognostic value of individual factors [8] and limiting research application [9]. Due to the important implications of this research for improving health outcomes, providing guidelines about how to judge the evidence of this widely varied research is necessary. This manuscript presents a system that can help reviewers judge the evidence derived from prognostic factor research [10]. We suggest that Grading of Recommendations Assessment, Development and Evaluation (GRADE) [11] can be adapted for the assessment of evidence derived from prognostic factor research.

GRADE: a framework to guide the judgment about the quality of evidence in a systematic review

GRADE was first developed to provide methodological guidance in reviewing intervention research; specifically, how to rate the quality associated with estimated effects of an intervention on a specific outcome, and how to grade the strength of recommendations regarding the intervention as part of a guideline development process [12]. When making judgments about the quality of evidence, the GRADE approach considers five factors that can decrease our confidence in estimates of effects: (1) study design and limitations in study design, (2) inconsistency of results across studies, (3) indirectness of the evidence, (4) imprecision and (5) publication bias; and three factors that can increase our confidence in estimates of effects from observational studies: (1) large estimates of treatment, (2) a dose–response gradient and (3) plausible confounding that would increase confidence in an estimate. The GRADE framework, widely used by researchers working on reviews and guidelines, and groups providing recommendations for health care professionals such as NICE [13], has also been formally adapted for use in grading the quality of evidence and strength of recommendation for diagnostic research [14]. Recently, Goldsmith and colleagues have also proposed using GRADE as a framework for prognostic studies [15]. They used the framework and definitions of GRADE to rate the quality of evidence for prognostic studies evaluating cold hyperalgesia as a prognostic factor in whiplash-associated disorders. They described the general framework that was followed; however, they failed to address all of GRADE's factors or provide enough information to allow replication of their GRADEing process for prognostic studies. The following sections outline specific aspects of the modified GRADE framework (for example, how risk of bias in primary prognosis studies were assessed), the process used to modify the GRADE framework, and suggestions for how to apply the new adapted framework to systematic reviews where meta-analyses are lacking.

Applying the GRADE framework to prognosis research

Consistent with the GRADE principles, when synthesizing the evidence from prognosis research it is important to estimate the effect of a factor on an outcome and also to report the level of confidence in these findings. Therefore, in a systematic review of prognostic factors it is recommended to assess the quality of evidence for each outcome of interest across studies. Our team has applied the GRADE framework and concepts to assess the quality of a body of evidence from prognosis studies into four quality categories (high, moderate, low, and very low) according to the traditional GRADE framework (Table 1). Along with our descriptions of how the GRADE framework can be adapted and used in this kind of research, we provide examples from a recent systematic review of prognostic studies conducted in the field of recurrent pain in children and adolescents (Table 2) (Huguet A, et al., in preparation) Since there are times when it is not appropriate or possible to conduct meta-analysis of prognostic evidence due to diversity within the studies included in the review or to poor methodological quality, or both, we also describe variations on how to implement the GRADE systematic review framework when conducting a narrative synthesis. We are not presenting a formalized guideline. Our recommendations are based on the discussions between the co-authors and our experience conducting systematic reviews of prognostic research on pain.

Table 1 Definitions of the four quality categories according to the original Grading of Recommendations Assessment, Development and Evaluation (GRADE)[16], applicable to the modified GRADE
Table 2 Summary of a systematic review of prognostic studies in the field of recurrent pain in children and adolescents

GRADE framework for prognosis

We think most of the factors taken into consideration by the GRADE framework to rate the quality of evidence from intervention research are conceptually applicable when applying the GRADE to judge the quality of evidence from prognostic research. However, our proposed assessment approach to determine how much each of these factors influence the quality of evidence is different from the approach originally suggested for intervention research. Table 3 compares the factors that may lead to rating down or up the quality of evidence from intervention research with the factors that may lead to rating down or up the quality of evidence from prognosis research. Notable differences when rating the quality of evidence from prognostic research include the following. (1) When judging the quality of prognostic evidence the study design is not considered since longitudinal research designs are the only acceptable ones that provide prognostic evidence (design characteristics are considered in the risk of bias assessment). (2) Using the GRADE framework to evaluate the quality of evidence from prognosis research should begin with the phase of investigation. (3) Plausible confounding does not need to be considered as an additional factor to rate up the quality of evidence. Two main reasons lead us to this decision. First, the potential effects of confounding in both intervention and prognosis research are not interpreted in the same direction. In intervention research, the assumption is that lack of control for confounding can inflate the reported effect sizes, so that the intervention appears to be more effective than it actually is when evidence is from studies that do not adequately control for confounding. This assumption is not made in prognostic research. It is often unclear how confounders alter effect sizes in prognostic studies. Second, plausible confounding is indirectly considered when assessing risk of bias. When assessing risk of bias, we are not evaluating how this will influence the strength of the effect; rather, we are evaluating the internal validity of the studies. As we describe below in more detail, when considering the potential impact of study limitations on the quality of evidence from prognostic research, the reviewers should consider downgrading the quality of evidence for methodological limitations when analyses are not adequately adjusted for confounders (which may be responsible for spurious or attenuated relations between the factor and the outcome).

Table 3 Factors that may increase and decrease the quality level of evidence

Next we describe the assessment approach for considering the potential effect of each of these factors on the quality of evidence.

Phase of investigation

When evaluating the overall quality of evidence, we suggest that researchers consider the phase of investigation. A high level of evidence for prognosis is derived from a cohort study design that seeks to generate understanding of the underlying processes for the prognosis of a health condition, called a phase 3 explanatory study, or a cohort study design that seeks to confirm independent associations between the prognostic factor and the outcome, called phase 2 explanatory studies [4, 7]. Prospective or retrospective cohort studies that test a fully developed hypothesis and conceptual framework without serious study limitations, and confirmatory studies without serious limitations constitute high-quality evidence on prognosis. These studies should be prioritized as primary studies. In emerging areas of research, there may be a lack of available primary studies that meet this criterion. In this instance, predictive modeling studies or explanatory studies conducted in the earlier phase of investigation to generate a hypothesis (phase 1 explanatory studies) may be included. These studies should be judged as providing weaker evidence. Therefore, we propose that the starting point for the quality level of the evidence should be based on phase of investigation (see Table 4). Table 5 illustrates an example of the effect of phase of investigation as a starting point for judging the quality of evidence.

Table 4 Guide to judge the quality of evidence for prognosis
Table 5 The effect of phase of investigation when judging quality of evidence

Following phase of investigation considerations, the evidence can be upgraded or downgraded according to the following additional criteria.

Reasons for downgrading the quality of evidence

Study limitations

The findings derived from individual prognostic studies are often limited for their methodological shortcomings. There are several tools available for assessing methodological limitations [1820]. We recommend using the Quality in Prognosis Studies (QUIPS) tool [19, 20], which rates individual studies according to the potential risk of bias associated with six domains: (1) study participation, (2) study attrition, (3) prognostic factor measurement, (4) outcome measurement, (5) confounding measurement and account, and (6) analysis. This tool, designed for use in prognostic factor studies to comprehensively assess risk of bias based on epidemiological principles, has demonstrated acceptable reliability [20]. The level of risk of bias associated with each domain can be rated as 'low’, 'moderate’ and 'high’ based on the responses that reviewers give to each item.

We suggest that, when assessing the risk of bias of a prognostic factor across studies for a specific outcome, reviewers should rate the evidence as having: (1) no serious limitations when most evidence is from studies at low risk of bias for most of the bias domains; (2) serious limitations when most evidence is from studies at moderate or unclear risk of bias for most of the bias domains; or (3) very serious limitations when most information is from studies at high risk of bias with respect to almost all of the domains.

Table 6 illustrates an example. As is similar to what happens when GRADE is used for intervention research, the study limitations are outcome and factor specific [21]. Consequently, one study may be associated with higher risk of bias when referring to one outcome or to one prognostic factor than one referring to another outcome or factor of interest. We advise conducting subgroup analyses to explore the impact of studies with high risk of bias on specific domains and from the overall review (that is, serious or very serious limitations). Sensitivity analyses may be conducted to restrict the synthesis to studies with lower risk of bias. These steps will inform how the study limitations influence the size of the effect. If studies with 'serious limitations’ or 'very serious limitations’ are included in the body of evidence to be evaluated, specific justification in a footnote for the relevant tables is suggested.

Table 6 The effect of considering study limitations when judging quality of evidence

Inconsistency

Inconsistency occurs when there is unexplained heterogeneity or variability in results across studies. When this happens, the quality of evidence decreases. Different approaches to assess inconsistency can be applied if the systematic review incorporates narrative synthesis or meta-analysis.

To evaluate whether inconsistency exists, we recommend that reviewers employing meta-analysis base their decisions on their judgment of whether a clinically meaningful difference exists between the point estimates and their confidence intervals of primary studies in the review context, as well as statistical parameters derived from their meta-analyses. Reviewers may consider downgrading the quality of evidence for inconsistency when they observe the following statistical parameters as long as these differences are clinically meaningful: (1) estimates of the effect of the prognostic factor on the outcome vary across studies with the points of effect on either side of the line of no effect and their confidence intervals show minimal or no overlap; (2) the statistical test for heterogeneity, which tests the null hypothesis that all studies in a meta-analysis have the same underlying magnitude effect, shows a low P value; and (3) the I2, which quantifies the proportion of the variation in point estimates due to true study variations in effect size from one study to the next, is substantial (that is, 50% or greater [25]).

Potential inconsistency related to differences in the magnitude of effects, as described above, should ideally be explored with a priori defined subgroup analyses. Differences in the population, the duration of follow-up, the outcome, the prognostic factor, or study methods across studies may explain differences. In this case, we propose to estimate separate effects accordingly. If, after running separate subgroup meta-analyses, this hypothesis is supported by the data, we suggest that reviewers consider presenting separate pooled estimates instead of estimating an overall combined effect.

If a meta-analysis is not conducted, we recommend that reviewers consider downgrading for inconsistency when estimates of the prognostic factor association with the outcome vary in direction (for example, some effects appear protective whereas others show risk) and the confidence intervals show no, or minimal overlap. See an example in Table 7.

Table 7 The effect of considering inconsistency when judging quality of evidence

Inconsistency cannot be assessed when only a single study within the existing body of literature has estimated the effect. In these cases, this criterion may be considered as 'not applicable’. However, we still recommend the reviewers to downgrade the quality of evidence since this is an indicator that the literature is not well established in the area. If observed inconsistency is unexplained, reviewers should decide whether the inconsistency is serious or very serious and justify why they have made this decision with a footnote.

Indirectness

Indirectness exists when the participant population, prognostic factor(s) and/or outcomes considered by researchers in the primary studies do not fully represent the review question defined in the systematic review. The judged quality of evidence decreases because the results derived from the primary studies are less generalizable for the purpose of the systematic review.

Regardless of whether reviewers perform a meta-analysis or not, downgrading the quality of evidence for indirectness is appropriate when: (1) the final sample only represents a subset of the population of interest (an example of indirectness in population is displayed in Table 8); (2) when the complete breadth of the prognostic factor that is being considered in the review question is not well represented in the available studies (an example is illustrated in Table 8); or (3) when the outcome that is being considered in the review question is not broadly represented (an example is displayed in Table 8).

Table 8 The effect of considering indirectness when judging quality of evidence

Downgrading the quality of the evidence with respect to indirectness depends on how extreme the differences are and how much these differences can influence the magnitude of effect. Reviewers should make this judgment based on the purpose of their review.

Imprecision

Random error or imprecision exists when the evidence is uncertain, leading to different interpretations about the relationship between the prognostic factor and its associated risk or protective value.

To judge whether the results of meta-analysis have sufficient precision a reviewer should first consider whether the number of participants included in the meta-analysis is appropriate through sample size estimation (similar to sample size estimates for a single study, but accounting for between-study heterogeneity; see Bull [33] for a discussion of the need of an adequate sample size in a meta-analysis). Second, if the number of participants included in the meta-analysis is appropriate, based on their best judgment, reviewers should consider the results precise when the confidence interval around the estimated effect size is not excessively wide while including values implying that the prognostic factor is associated with protection or increased risk.

Evaluating the imprecision when a meta-analysis is not possible is challenging. The best approach is for reviewers to judge the overall precision based on precision of results within each study while taking into account the number of studies and participants involved. If the majority of studies included in the review are precise, regardless of the number of studies and sample size, reviewers should not downgrade the quality of evidence for imprecision. To estimate whether the primary studies included in the review are unpowered or not, reviewers should take into account the sample size that each of the primary studies used. To do this, they should explore whether the authors of the primary studies provided any rationale for the sample size. Authors of prognostic studies should estimate an effect size and select the desired power beforehand, and then calculate the sample size needed in order to achieve that power and to detect the specified effect size [5, 33]. If authors do not provide such rationale, reviewers can consider the sample size appropriate for studies using dichotomous outcomes using the 'rule of thumb’, when there are at least 10 outcome events for each potential prognostic variable considered in the analysis [4, 34]. If insufficient information is available to determine appropriateness of sample size, or studies use continuous outcomes, the sample size can be considered appropriate when there are at least 100 cases that reached the endpoint [35]. If the sample sizes are large enough, the reviewers should then evaluate the width of the interval. If most of the confidence intervals reported in each study include both no effect and appreciable risk and protective values, the evidence derived from that particular study is imprecise. At that stage, to evaluate the overall imprecision for the explored prognostic factor association, we recommend reviewers to also consider the number of studies, and number of participants across studies, because there is likely to be more imprecision with a fewer number of studies and/or participants. Consequently, we recommend downgrading the quality of evidence if the evidence is generated by a few studies involving a small number of participants and most of the studies provide imprecise results. See an example in Table 9.

Table 9 The effect of considering imprecision when judging quality of evidence

Publication bias

This is a very important factor in prognostic study evidence because investigators often fail to report relationships that show no effect between potential prognostic factors and outcomes. This happens when published evidence is restricted to only a portion of the studies conducted on the topic [36]. The current lack of an existing register for prognostic research studies prevents reviewers from making an informed judgment about whether there is evidence that publication bias is a potential problem. Consequently, a prudent default position at this moment is to assume that prognosis research is seriously affected by publication bias until there is evidence to the contrary [6]. Reviewers should consider that publication bias exists across all factors except in those cases in which they find that a determinate prognostic factor has been investigated in a large number of cohort studies. Ideally, most of these large numbers of studies should have been designed to purposefully confirm the hypothesized independent effect of the factor on the outcome (phase 2 study) or to test a conceptual model which explains its underlying mechanisms (phase 3 study). However, since phase of investigation is already taken into account as a factor that can downgrade the overall quality of evidence, we do not recommend downgrading again for publication bias due to only phase of investigation. For these cases, reviewers conducting a systematic review with or without meta-analysis may judge that there is less likely to be publication bias (see Table 10).

Table 10 The effect of considering publication bias when judging quality of evidence

Reasons for rating up the quality of evidence

Moderate or large effect size

Multiple prognostic factors often contribute to the prognosis of health conditions. Therefore, finding a moderate or large effect size is one of the key criteria for rating up the quality of evidence. A moderate or large effect size increases the likelihood that a relationship between the prognostic factor and the outcome does in fact exist.

Reviewers should rate up the quality of evidence when they find a moderate or large pooled effect of the meta-analysis. 'Rules of thumb’ have been proposed to judge the effect moderate or large (for example, standardized mean difference statistic = around 0.5 for moderate effect or around 0.8 or larger for large effect [40], odds ratio = around 2.5 for moderate effect or 4.25 or greater for large effect [41, 42]). However, because these are arbitrary guidelines, a sensible decision should be made and justified by the reviewers taking into account the study context (for example, background risk and unit of measurement).

If it is not possible to conduct a meta-analysis, reviewers should rate up the quality of evidence when they find moderate or large similar effects reported by most of the primary studies (see an example in Table 11).

Table 11 The effect of considering moderate or large effect size when judging quality of evidence

Exposure-response gradient

An exposure-response gradient exists when elevated levels of the prognostic factor (for example, larger amount, longer duration, higher intensity) lead to a larger effect size over lower levels of the factor. The presence of such a gradient increases confidence in the findings that the factor is associated with an increased risk or protective value and therefore raises our rating of the quality of evidence.

When conducting systematic reviews with meta-analysis, the reviewers should observe whether an exposure-response gradient is present between subgroup analyses for factors measured at different doses.

If a meta-analysis or subgroup analyses are not conducted, or only one meta-analysis is conducted for the relationship between a prognostic factor and outcome, reviewers should observe whether a possible exposure-response gradient consistently exists within and between primary studies. The use of the same measures to evaluate prognostic factors and outcomes across studies is a required condition to appropriately evaluate the possible existence of this gradient between studies. Table 12 displays an example.

Table 12 The effect of considering exposure gradient when judging quality of evidence

The findings derived from the GRADE framework to judge the evidence derived from prognostic studies can be presented in the proposed summary of findings tables (see Tables 13 and 14).

Table 13 Example of an adapted Grading of Recommendations Assessment, Development and Evaluation (GRADE) table for systematic reviews with meta-analysis of prognostic studies
Table 14 Example of an adapted Grading of Recommendations Assessment, Development and Evaluation (GRADE) table for narrative systematic reviews of prognostic studies (filled in with examples of our own review illustrated in the boxes throughout this manuscript)

Conclusions

This article is a first attempt to outline how GRADE may be adapted to assess the quality of evidence for prognostic research studies for a systematic review of the literature. To date, a formal system to guide researchers in assessing the evidence for prognostic studies has been lacking. Our adaptation is a first step in developing a systematic approach to evaluate the quality of evidence for prognostic research. We encourage these recommendations to be further developed and tested for GRADEing the quality of evidence when synthesizing findings from prognostic research studies. For instance, further guidance for reviewers is needed to decide when each of these GRADE factors causes the evidence to be down- or upgraded one versus two levels. At this stage, we leave this decision up to the judgment of the review teams to decide how much these factors impact the overall quality of evidence. Empirical research is also needed to explore our hypothesis that risk of bias detrimentally affects the factor-outcome relationship and what the strength of this relationship is.