Introduction

With increasing frequency, multiple objective measures of normal or pathologic biological processes as well as measures of social, psychological, behavioral and demographic features are being associated with important patient outcomes. Some of these measures, singly or in combination as a prediction model, can be clinically useful. The plethora of potential new prognostic tests and prediction models, like treatments and diagnostic tests, is an appropriate topic for systematic review. Such reviews can serve to summarize available evidence, as well as guide further research regarding the usefulness of the test. The questions that are most salient for clinical practice, and hence a systematic review concern the accuracy of predictions derived from a test or prediction model, and how the results affect patient management and outcomes.

This paper is meant to complement the Evidence-based Practice Center Methods Guide for Comparative Effectiveness Reviews, and is not a comprehensive or detailed review of methods that could be used to conduct a systematic review of a prognostic test. Generally speaking, the steps for reviewing evidence for prognostic tests are similar to those used in the review of a diagnostic test and discussed in other papers in this Medical Test Methods Guide. These steps include: 1) using the population, intervention, comparator, outcomes, timing and setting (PICOTS) typology and an analytic framework to develop the topic and focus the review on the most important key questions, 2) conducting a thorough literature search, 3) assessing the quality of reported studies, 4) extracting and summarizing various types of statistics from clinical trials and observational studies, and 5) meta-analyzing study results. However, important differences between diagnostic and prognostic tests highlighted here should be considered when planning and conducting a review.

  1. Step 1:

    Developing the Review Topic and Framework

Developing the review topic, including the framework for thinking about the relationship between the test and patient outcomes, as well as the key questions, can be fundamentally different for diagnostic and prognostic tests. A diagnostic test is used to help determine whether a patient has a disease at the time the test is performed. Evaluations of diagnostic tests often use a categorical reference test (gold standard) to determine the true presence or absence of the disease. Typically patients are classified as diagnostic test positive or negative to estimate the test’s accuracy as sensitivity (true positive fraction) and specificity (true negative fraction). In contrast, a prognostic test is used to predict a patient’s likelihood of developing a disease or experiencing a medical event. Therefore, the “reference test” for a prognostic test is the observed proportion that develop what is being predicted.

For practical purposes, it is often useful to group the results of a prognostic test into parsimonious categories corresponding to the implications for decision making. For example, if the actions that might follow a prognostic test are no further evaluation or treatment of “low” risk cases, initiation of treatment or prevention in “high” risk cases, or further tests or monitoring for “intermediate” risk cases, then it would be useful to structure the review according to these prognostic test categories (low, intermediate and high risk) and clearly define each group including its outcome probabilities. If a decision model is used as the framework for a systematic review and meta-analysis of a prognostic test, the precision and accuracy of estimates of outcome probabilities within these different prognostic groups may be the primary focus. These considerations, among others are summarized in Table 1, which provides a general PICOTS framework for systematically reviewing prognostic tests.

Table 1 General PICOTS Typology for Review of Prognostic Tests

In some contexts, it may be informative to categorize subjects as those who did or did not experience the predicted outcome within a specfied time interval and then look back to categorize the results of the prognostic test. Much as for a diagnostic test, a systematic review of a prognostic test could then assess the accuracy of the prognostic test by calculating the sensitivity and specificity and predictive values for that point in time. An essential factor to consider in a review is what follow-up times are especially informative to patients, clinicians or policymakers.

A somewhat unique category of prognostic tests are those that can be used to predict beneficial or adverse responses to a treatment commonly known as predictive tests. Evidence about the value of a predictive test typically is presented as separate estimates of the treatment effect in subgroups defined by the predictive test along with a statistical test for interaction. Systematic reviews of predictive test/treatment interactions are not specifically discussed in this paper. Interested readers are referred to publications on this topic1

  1. Step 2:

    Searching for Studies

When developing the literature search strategy, it is important to recognize that studies can relate to one or more of the following categories2.

  1. 1.

    Proof of concept: Is the test result associated with a clinically important outcome?

  2. 2.

    Prospective clinical validation: How accurately does the test predict outcomes in different cohorts of patients, clinical practices and prognostic groups?

  3. 3.

    Incremental predictive value: How much does the new prognostic test change predicted probabilities and increase the discrimination of patients who did or did not experience the outcome of interest within a specific time period?

  4. 4.

    Clinical utility: Does the new prognostic assessment change predicted probabilities enough to reclassify many patients into different prognostic groups that would be managed differently?

  5. 5.

    Clinical outcomes: Would use of the prognostic test improve patient outcomes?

  6. 6.

    Cost effectiveness: Do the improvements in patient outcomes justify the additional costs of testing and subsequent medical care?

Each phase of development is focused on different types of questions, research designs, and statistical methods although a single study might address several of these questions. Large cohort studies and secondary analyses of clinical trials may be the most readily available evidence to answer the first four types of questions. For the latter two types of questions, randomized controlled trials of prognostic tests are preferred. However, they can be costly and time consuming, and thus are rarely done by stakeholders3. Before embarking on a review focused on the last two types of key questions, reviewers need to think about what they would do, if anything, in the absence of randomized controlled studies of the effect of a prognostic test on patient outcomes. One option is to use a decision model to frame the review and focus on providing the best estimates of outcome probabilities.

Reliable and validated methods to exhaustively search the literature for information about prognostic tests have not been established, and the best bibliographic indexes and search strategies have yet to be determined. Some search strategies have been based on variations of key words in titles or abstracts and index terms that appear in publications meeting the study selection crtieria4. Others have used search terms such as “cohort,” “incidence,” “mortality,” “follow-up studies,” “course,” or the word roots “prognos-” and “predict-” to identify relevant studies5. Obviously, the range of terms used to describe the prognostic test(s) and the clinical condition or medical event to be predicted should be used as well. The “find similar” or “related article” functions available in some indexes may be helpful. A manual search of reference lists will need to be done. If a prognostic test has been submitted for review by regulatory agencies such as the Food and Drug Administration, the records that are available for public review should be searched. The website of the test producer could provide useful information too.

In contrast to diagnostic tests, many prognostic tests are incorporated into multivariable regression models or algorithms for prediction. Many reports in the literature only provide support for an independent association of a particular variable with the patient outcome that might be useful as a prognostic test6,7. The converse—that a test variable did not add significantly to a multivariable regression model—is difficult to find, particularly via an electronic search or abstract reviews when the focus is often on positive findings8. Given the potential bias introduced by failing to uncover evidence of lack of a strong association, hence predictive value, if a review is going to focus on proof-of-concept questions, all studies that included the test variable should be sought out, reviewed, and discussed even when the study merely mentions that the outcome was not independently related to the potential prognostic test or a component of a multivariable prediction model9.

Whenever a systematic review focuses on key questions about prognostic groups that are defined by predicted outcome probabilities, reviewers should search for decision analyses, guidelines, or expert opinions that help support the outcome probability thresholds used to define clinically meaningful prognostic groups, that is, groups that would be treated differently in practice because of their predicted outcome. Ideally, randomized controlled clinical trials of medical interventions in patients selected based on the prognostic test would help establish the rationale for using the prognostic test to classify patients into the prognostic groups (although this is not always sufficient to evaluate this use of a prognostic test)1,3.

  1. Step 3:

    Selecting Studies and Assessing Quality

Previous reviews of prognostic indicators have demonstrated substantial variation in study design, subject inclusion criteria, methods of measuring key variables, follow-up time, methods of analysis (including definition of prognostic groups), adjustments for covariates, and presentation of results1012. Some of these difficulties could be overcome if reviewers were given access to the individual patient-level data from studies, which would allow them to conduct their own analyses in a more uniform manner. Lacking such data, several suggestions have been made for assessing studies to make judgments about the quality of reports and whether to include or exclude them from a review5,13,14. Table 2 lists questions that should be considered. At this time, reviewers will need to decide which of these general criteria or others are appropriate for judging studies for their particular review. As always, reviewers should be explicit about any criteria that were used to exclude or include studies from a review. Validated methods to use criteria to score the quality of studies of prognostic tests need to be developed.

Table 2 Outline of Questions for Judging the Quality of Individual Studies of Prognostic Tests

Comparisons of prognostic tests should use data from the same cohort of subjects to minimize confounding the comparison. Within a study, the prognostic tests being compared should be conducted at the same time to ensure a common starting point with respect to the patient outcome being predicted. Reviewers should also note the starting point of each study reviewed. All of the prognostic test results and interpretation should be ascertained without knowledge of the outcome to avoid ascertainment bias. Investigators should be blinded to the results of the prognostic test to avoid selective changes in treatment that could affect the outcome being predicted. Reviewers need to be aware of any previously established prognostic indicators that should be included in a comparative analysis of potential new prognostic tests, and pay close attention to that with which a new prognostic test is compared. Any adjustments for covariates that could make studies more or less comparable also need to be noted15.

If the investigators fit a new prognostic test or prediction equation to the sample data (test development sample) by using the data to define cut-off levels or model its relationships to the outcome and estimate regression coefficient(s), the estimated predictive performance can be overly optimistic. In addition, the fitting might bias the comparison to an established prognostic method that was not fit to the same sample.

  1. Step 4:

    Extracting Statistics to Evaluate Test Performance

The summary statistics reported in the selected articles need to be appropriate for the key question(s) the review is trying to address. For example, investigators commonly report estimated hazard ratios from Cox regression analyses or odds ratios from logistic regression analyses to test for associations between a potential prognostic test and the patient outcome. These measures of association address only early phases in the development of a potential prognostic test—proof of concept and perhaps validation of a potentially predictive relationship to an outcome in different patient cohorts, and to a very limited extent the potential to provide incremental predictive value. Potential predictors that exhibit statistically significant associations with an outcome often do not substantially discriminate between subjects who eventually do or do not experience the outcome event because the distributions of the test result in the two outcome groups often overlap substantially even when the means are highly significantly different16,17. Statistically significant associations (hazard ratios, relative risk, or odds ratios) merely indicate that more definitive evaluation of a new predictor is warranted18,19. Nevertheless, for reviewers who are interested in these associations, there are well-established methods for summarizing estimates of hazard, relative risks or odds ratios2023. However, the questions a systematic review could answer about the use of a prognostic test by summarizing its association with an outcome are quite limited and not likely to impact practice. More relevant are the estimates of absolute risk in different groups.

Discrimination statistics. The predictive performance of prognostic tests is often reported in a manner similar to diagnostic tests using estimates of sensitivity, specificity, and the area under the receiver operating characteristic (ROC) curve at one particular follow-up time. These indices of discrimination can be calculated retrospectively and compared when a new prognostic indicator is added to a predictive model or a prognostic test is compared to predictions made by other methods, including experienced clinicians2427. However, these backward-looking measures of discrimination do not summarize the predicted outcome probabilities and do not directly address questions about the predictions based on a new prognostic test or its impact on patient outcomes2830. The next section on reclassification tables describes other measures of test discrimination that can help reviewers assess, in part, the clinical impact of prognostic tests. If reviewers elect to use the more familiar and often reported discrimination statistics, then they must be cognizant of the fact that they change over time as more patients develop the outcome being predicted. Time-dependent measures of sensitivity, specificity, and the ROC curve have been developed31. Harrell’s C-statistic is conceptually similar to an area under an ROC curve and can be derived from time-to-event analyses32,33. Examples of systematic reviews and meta-analyses of prognostic tests that used these time-dependent measures of discrimination were not found.

Reclassification tables. The clinical usefulness of a prognostic test depends largely on its ability to place patients into different prognostic groups and provide accurate predictions about their future health. For example, expert guidelines use prognostic groups defined by the estimated 10-year risk of developing cardiovascular disease (<10%, 10 to 20% and >20%) based on the Framingham cardiovascular risk score to help determine whether to recommend interventions to prevent future cardiovascular events34. Analyses of reclassification tables are now being reported to determine how adding a prognostic test reclassifies patients into the prognostic groups3538. Table 3 shows a hypothetical example of a reclassification table. Ideally, the classification of outcome probabilities into prognostic groups (arbitrarily set at an individual predicted probability >0.10 in the example) should be based on outcome probabilities that will lead to different courses of action. If not, the reviewer needs to take note, because the observed reclassifications could be clinically meaningless in the sense that they might not be of sufficient magnitude to alter the course of action; that is to say, some reclassification of patients by a prognostic test might not make any difference in patient care. In the example, adding the new prognostic test reclassified 10% of the 1000 people originally in the lower risk group and 25% of the 400 people in the higher risk group.

Table 3 Example Reclassification Table Based on Predicted Outcome Probabilities

Reclassification tables typically provide information about the observed outcome probabilities in each prognostic group (summarized as percentages in the example) and the predicted probabilities. However, this information is often limited to a single follow-up time, and the precision of the estimates might not be reported. The differences between the estimated probabilities and observed outcomes for each prognostic group might be analyzed by a chi-square goodness-of-fit test39. However, these results will not help the reviewer determine if the differences in predicted and observed probabilities are substantially better when the new prognostic test is added. In the example depicted in Table 3, the differences between predicted and observed values for each prognostic test shown in the column and row totals are small, as expected whenever prognostic groups have a narrow range of individual predicted probabilities and the prediction models are fit to the data rather than applied to a new sample.

Reviewers might also encounter articles that report separate reclassification tables for patients who did or did not experience the outcome event within a specific period of time along with a summary statistic known as the net reclassification improvement (NRI)40. In the group that developed the outcome event within the specified period of time, the net improvement is the proportion of patients who were reclassified by a prognostic test into a higher probability subgroup minus the proportion who were reclassified into a lower probability subgroup. In a 2-by-2 reclassification table of only subjects who experienced the outcome event (e.g., those who died), this net difference is the estimated change in test sensitivity. In the group who did not experience the outcome event, the net improvement is the proportion of patients who were reclassified into a lower probability subgroup minus the proportion who were reclassified into a higher probability subgroup. In a 2-by-2 reclassification table of only subjects who did not experience the event within the follow-up period (e.g., those who survived), this net difference is the estimated change in specificity. The NRI is the simple sum of net improvement in classifcation of patients that did or did not experience the outcome.

If these calculations use the mean changes in individual predicted probabilities in the patients that did or did not experience the outcome, the result is known as the integrated discrimination index (IDI). Another formulation of the NRI calculates the probabilities of the predicted event among those that have an increase in their predicted probability given the results of a new prognostic test, the probabilities of the predicted event among those that have a decrease in their predicted probability, and the event probability in the overall sample41. These three probabilities can be estimated by time-to-event analysis but still only represent a single point of follow-up. This so-called continuous formulation of the NRI doesn’t require one to define clinically meaningful prognostic categories. Rather, it focuses on subjects that have, to any degree, a higher or lower predicted outcome probability when a new prognostic test is employed. Not all increases or decreases in predicted probabilities would be clinically meaningful in the sense that they would prompt a change in patient management.

Estimates of the NRI or IDI from different studies could be gleaned from the literature comparing prognostic tests. Several issues need to be examined before trying to pool estimates from different studies. Reviewers should make sure the characteristics of prognostic groups, definition of the outcome event, overall probability of the event and the follow-up time did not vary substantially between studies.

Predictive values. Treatment decisions based on outcome probabilities are often dichotomous—for example, “treat those at high-risk” and “don’t treat those at low-risk” groups. If patients would be treated because a prognostic test indicates they are “high risk”, then the observed time-dependent percentages of patients developing the outcome without treatment are essentially positive predictive values(i.e. the proportion of those with a ‘positive’ prognostic test that have the event). If clinicians would not treat patients in the lower risk group, then one minus the observed time-dependent outcome probabilities are the negative predictive values (i.e. the proportion of those with a ‘negative’ prognostic test that don’t have the event). For a single point of follow-up, these positive and negative predictive values can be compared using methods devised for comparing predictive values of diagnostic tests. Most likely the ratios of positive and negative predictive values of two prognostic tests will be summarized in a report, along with a confidence interval42. The regression model proposed by Leisenring and colleagues might be used to determine how patient characteristics relate to the relative predictive values43. Methods to compare predictive values of two prognostics tests that are in the form of time-to-event curves are available if encountered during a review4447.

  1. Step 5:

    Meta-Analysis of Estimates of Outcome Probabilities

The most definitive level of evidence to answer the most important questions about a prognostic test or comparison of prognostic tests would come from randomized controlled trials designed to demonstrate a net improvement in patient outcomes and cost-effectiveness. Many studies of prognostic tests do not provide this ultimate evidence. However, a systematic review could provide estimates of outcome probabilities for decision models48. Estimates could come from either randomized controlled trials or observational studies as long as the prognostic groups they represent are well-characterized and similar. A meta-analysis could provide more precise estimates of outcome probabilities. In addition, meta-analysis of estimated outcome probabilities in a prognostic group extracted from several studies may provide some insights into the stability of the estimates and whether variation in the estimates is related to characteristics of the prognostic groups.

Methods have been developed to combine estimates of outcome probabilities from different studies20. Dear’s method uses a fixed effects regression model while Arend’s method is similar to a DerSimonian–Laird random-effects model when there is only one common follow-up time for all studies/prognostic groups in the analysis49,50. These references should be consulted if interested in this type of meta-analysis.

Conclusion

There’s a large and rapidly growing amount of literature about prognostic tests. A systematic review can determine what is known and what needs to be determined to support use of a prognostic test by decision makers. Hopefully, this guidance will be helpful to reviewers who want to conduct an informative review of a prognostic test, and spur efforts to establish consensus methods for reporting studies of prognostic tests and conducting reviews of them.

Key Points

  • Methods to conduct a clinically oriented systematic review of a prognostic test are not well established. Several issues discussed herein will need to be addressed when planning and conducting a review.

  • The intended use of the prognostic test under review needs to be specified, and predicted probabilities need to be classified into clinically meaningful prognostic groups, i.e. those that would be treated differently. The resultant prognostic groups need to be described in detail including their outcome probabilities.

  • A large number of published reports focus on the associations between prognostic indicators and patient outcomes, the first stage of development of prognostic tests. A review of these types of studies would have limited clinical value.

  • Criteria to evaluate and score the quality of studies of prognostic tests have not been firmly established. Reviewers can adapt criteria that have been developed for judging studies of diagnostic tests and cohort studies with some modifications for differences inherent in studies of prognostic tests. Suggestions are listed in Table 2.

  • Given the fundamental difference between diagnostic tests that determine the current health state of disease and prognostic tests that predict a future state of disease, some of the most commonly used statisitcs for evaluating diagnostic tests, such as point estimates of test sensitivity and specificity and receiver operator characteristic curves, are not as informative for prognostic tests. The most pertinent summary statistics for prognostic tests are the time-dependent observed outcome probabilities within clearly defined prognostic groups, the closeness of each group’s predicted probabilities to the observed outcomes, and how use of a new prognostic test reclassifies patients into different prognostic groups and improves predictive accuracy and overall patient outcomes.

  • Methods to compare and summarize the predictive performance of prognostic tests need further development and widespread use to facilitate systematic reviews.