Chapter 8: Metaanalysis of Test Performance When There is a “Gold Standard”
Authors
 First Online:
DOI: 10.1007/s1160601220291
 Cite this article as:
 Trikalinos, T.A., Balion, C.M., Coleman, C.I. et al. J GEN INTERN MED (2012) 27: 56. doi:10.1007/s1160601220291
Abstract
Synthesizing information on test performance metrics such as sensitivity, specificity, predictive values and likelihood ratios is often an important part of a systematic review of a medical test. Because many metrics of test performance are of interest, the metaanalysis of medical tests is more complex than the metaanalysis of interventions or associations. Sometimes, a helpful way to summarize medical test studies is to provide a “summary point”, a summary sensitivity and a summary specificity. Other times, when the sensitivity or specificity estimates vary widely or when the test threshold varies, it is more helpful to synthesize data using a “summary line” that describes how the average sensitivity changes with the average specificity. Choosing the most helpful summary is subjective, and in some cases both summaries provide meaningful and complementary information. Because sensitivity and specificity are not independent across studies, the metaanalysis of medical tests is fundamentaly a multivariate problem, and should be addressed with multivariate methods. More complex analyses are needed if studies report results at multiple thresholds for positive tests. At the same time, quantitative analyses are used to explore and explain any observed dissimilarity (heterogeneity) in the results of the examined studies. This can be performed in the context of proper (multivariate) metaregressions.
KEY WORDS
gold standard test performance metaanalysisINTRODUCTION
The series of papers in this supplement of the journal highlights common challenges in systematic reviews of medical tests and outlines their mitigation, as perceived by researchers partaking in the Agency for Healthcare Research and Quality (AHRQ) Effective Healthcare Program. Generic by their very nature, these challenges and their discussion apply to the larger set of systematic reviews of medical tests, and are not specific to AHRQ’s program.
This paper focuses on choosing strategies for metaanalysis of test “accuracy”, or more preferably, test performance. Metaanalysis is not required for a systematic review, but when appropriate, it should be undertaken with a dual goal: to provide summary estimates for key quantities, and to explore and explain any observed dissimilarity (heterogeneity) in the results of the examined studies.
“Summingup” information on test performance metrics such as sensitivity, specificity, and predictive values is rarely the most informative part of a systematic review of a medical test.1–4 Key clinical questions driving the evidence synthesis (e.g., is this test alone or in combination with a testandtreat strategy likely to improve decisionmaking and patient outcomes?) are only indirectly related to test performance per se. Formulating an effective evaluation approach requires careful consideration of the context in which the test will be used. These framing issues are addressed in other papers in this issue of the journal.5–7 Further, in this paper we assume that medical test performance has been measured against a “gold standard”, that is a reference standard that is considered adequate in defining the presence or absence of the condition of interest. Another paper in this supplement discusses ways to summarize medical tests when such a reference standard does not exist.8
 1)
Briefly, it may be helpful to use a “summary point” (a summary sensitivity and summary specificity pair) to obtain summary test performance when sensitivity and specificity estimates do not vary widely across studies. This could happen in metaanalyses where all studies have the same explicit test positivity threshold (a threshold for categorizing the results of testing as positive or negative) since if studies have different explicit thresholds, the clinical interpretation of a summary point is less obvious, and perhaps less helpful. However, an explicit common threshold is neither sufficient nor necessary for opting to synthesize data with a “summary point”; a summary point can be appropriate whenever sensitivity and specificity estimates do not vary widely across studies.
 2)
When the sensitivity and specificity of various studies vary over a large range, rather than using a “summary point”, it may be more helpful to describe how the average sensitivity and average specificity relate by means of a “summary line”. This oftencountered situation can be secondary to explicit or implicit variation in the threshold for a “positive” test result, heterogeneity in populations, reference standards, or the index tests, study design, chance, or bias.
Of note, in many applications it may be informative to present syntheses in both ways, as they convey complementary information.
Deciding whether a “summary point” or a “summary line” is more helpful as a synthesis is subjective, and no hardandfast rules exist. We briefly outline common approaches for metaanalyzing medical tests, and discuss principles for choosing between them. However, a detailed presentation of methods or their practical application is outside the scope of this work. In addition, it is expected that readers are versed in clinical research methodology, and familiar with methodological issues pertinent to the study of medical tests. We also assume familiarity with the common measures of medical test performance (reviewed in the Appendix, and in excellent introductory papers).9 For example, we do not review challenges posed by methodological or reporting shortcomings of test performance studies.10 The Standards for Reporting of Diagnostic accuracy (STARD) initiative published a 25item checklist that aims to improve reporting of medical tests studies.10 We refer readers to other papers in this issue11 and to several methodological and empirical explorations of bias and heterogeneity in medical test studies.12–14
Nonindependence of sensitivity and specificity across studies and why it matters for metaanalysis
Negative correlation between sensitivity and specificity across studies may be expected for reasons unrelated to thresholds for positive tests. For example, in a metaanalysis evaluating the ability of serial creatine kinaseMB (CKMB) measurements to diagnose acute cardiac ischemia in the emergency department,16, 17 the time interval from the onset of symptoms to serial CKMB measurements (rather than the actual threshold for CKMB) could explain the relationship between sensitivity and specificity across studies. The larger the time interval, the more CKMB is released into the bloodstream, affecting the estimated sensitivity and specificity. Unfortunately, the term “threshold effect” is often used rather loosely to describe the relationship between sensitivity and specificity across studies, even when, strictly speaking, there is no direct evidence of variability in study thresholds for positive tests.
Because of the above, the current thinking is that in general, the study estimates of sensitivity and specificity do not vary independently, but jointly, and likely with a negative correlation. Summarizing the two correlated quantities is a multivariate problem, and multivariate methods should be used to address it, as they are more theoretically motivated.18, 19 At the same time there are situations when a multivariate approach is not practically different from separate univariate analyses. We will expand on some of these issues.
PRINCIPLES FOR ADDRESSING THE CHALLENGES

Principle 1: Favor the most informative way to summarize the data. Here we refer mainly to choosing between a summary point and a summary line, or both.

Principle 2: Explore the variability in study results with graphs and suitable analyses, rather than relying exclusively on “grand means”.
RECOMMENDED APPROACHES
Which metrics to metaanalyze
Why it does make sense to directly metaanalyze sensitivity and specificity
Summarizing studies with respect to sensitivity and specificity aligns well with our understanding of the effect of positivity thresholds for diagnostic tests. Further, sensitivity and specificity are often considered independent of the prevalence of the condition under study (though this is an oversimplification that merits deeper discussion).20 The summary sensitivity and specificity obtained by a direct metaanalysis will always be between zero and one. Because these two metrics do not have as intuitive an interpretation as likelihood ratios or predictive values,9 we can use formulas in the Appendix to backcalculate “summary” (overall) predictive values and likelihood ratios that correspond to the summary sensitivity and specificity for a range of plausible prevalence values.
Why it does not make sense to directly metaanalyze positive and negative predictive values or prevalence
Predictive values are dependent on prevalence estimates. Because prevalence is often wide ranging, and because many medical test studies have a casecontrol design (where prevalence cannot be estimated), it is rarely meaningful to directly combine these across studies. Instead, predictive values can be calculated as mentioned above from the summary sensitivity and specificity for a range of plausible prevalence values.
Why directly metaanalyzing likelihood ratios could be problematic
Positive and negative likelihood ratios could also be combined in the absence of threshold variation, and in fact, many authors give explicit guidance to that effect.21 However, this practice does not guarantee that the summary positive and negative likelihood ratios are “internally consistent”. Specifically, it is possible to get summary likelihood ratios that correspond to impossible “summary” sensitivities or specificities (outside the zero to one interval).22 Backcalculating the “summary” likelihood ratios from summary sensitivities and specificities avoids this complication. Nevertheless, these aberrant cases are not common,23 and calculations of summary likelihood ratios by directly metaanalyzing them or from back calculation of the summary sensitivity and specificity rarely results in different conclusions.23
Directly metaanalyzing diagnostic odds ratios
The synthesis of diagnostic odds ratios is straightforward and follows standard metaanalysis methods.24, 25 The diagnostic odds ratio is closely linked to sensitivity, specificity, and likelihood ratios, and it can be easily included in metaregression models to explore the impact of explanatory variables on betweenstudy heterogeneity. Apart from challenges in interpreting diagnostic odds ratios, a disadvantage is that it is impossible to weight the true positive and false positive rates separately.
Desired characteristics of metaanalysis methods
Commonly Used Methods for MetaAnalysis of Medical Test Performance
Method 
Description or comment 
Does it have desired characteristics? 

Summary point  
Independent metaanalysis of sensitivity and specificity 
Separate metaanalyses per metric 
Ignores correlation between sensitivity and specificity 
Withinstudy variability preferably modeled by the binomial distribution.44 
Underestimates summary sensitivity and specificity and incorrect confidence intervals26  
Joint (multivariate) metaanalysis of sensitivity and specificity based on hierarchical modeling 
Based on multivariate (joint) modeling of sensitivity and specificity. 
The generally preferred method 
Two families of models26, 30 (see text), equivalent when there are no covariates18  
Modeling preferably using binomial likelihood rather than normal approximations30, 37, 45, 46  
Summary line  
Moses and Littenberg model 
Summary line based on a simple regression of the difference of logittransformed true and false positive rates versus their average.32–34 
Ignores unexplained variation betweenstudies (fixed effects) 
Does not account for correlation between sensitivity and specificity  
Does not account for variability in the independent variable  
Inability to weight studies optimally—yields wrong inferences when covariates are used  
Random intercept augmentation of the MosesLittenberg model 
Regression of the difference of logittransformed true and false positive rates versus their average with random effects to allow for variability across studies35, 36 
Does not account for correlation between sensitivity and specificity 
Does not account for variability in the independent variable  
Summary ROC based on hierarchical modeling 
Same as for multivariate metaanalysis to obtain a summary point—hierarchical modeling26, 30 
Most theoretically motivated method 
Many ways to obtain a (hierarchical) summary ROC : 
RutterGatsonis HSROC recommended in the Cochrane handbook,47 as it is the method with which there is most experience  
RutterGatsonis (most common)30  
We will focus on the case where each study reports a single pair of sensitivity and specificity at a given threshold (although thresholds can differ across studies). Another, more complex situation arises when multiple sensitivity and specificity pairs (at different thresholds) are reported in each study. Statistical models for the latter case exist, but there is less empirical evidence on their use. These will be described briefly, as a special case.
Preferred methods for obtaining a “summary point” (summary sensitivity and specificity): two families of hierarchical models
When a “summary point” is deemed a helpful summary of a collection of studies, one should ideally perform a multivariate metaanalysis of sensitivity and specificity, i.e., a joint analysis of both quantities, rather than separate univariate metaanalyses. This is not only theoretically motivated,26–28 but also corroborated by simulation analyses.1, 27, 29
Multivariate metaanalyses require advanced hierarchical modeling. We can group the commonly used hierarchical models in two families: The so called “bivariate model”26 and the “hierarchical summary ROC” (HSROC) model.30 Both use two levels to model the statistical distributions of data. At the first level, they model the counts of the 2 × 2 table within each study, which accounts for withinstudy variability. At the second level, they model the betweenstudy variability (heterogeneity), allowing for the theoretically expected nonindependence of sensitivity and specificity across studies. The two families differ in their parameterization at this second level: the bivariate model uses parameters that are transformations of the average sensitivity and specificity—while the HSROC model uses a scale parameter and an accuracy parameter, which are functions of sensitivity and specificity, and define an underlying hierarchical summary ROC curve.
In the absence of covariates, the two families of hierarchical models are mathematically equivalent; one can use simple formulas to relate the fitted parameters of the bivariate model to the HSROC model and vice versa, rendering choices between the two approaches moot.18 The importance of choosing between the two families becomes evident in metaregression analyses, when covariates are used to explore betweenstudy heterogeneity. The differences in design and conduct of the included diagnostic accuracy studies may affect the choice of the model.18 For example, “spectrum effects,” where the subjects included in a study are not representative of the patients who will receive the test in practice,31 “might be expected to impact test accuracy rather than the threshold, and might therefore be most appropriately investigated using the HSROC approach. Conversely, betweenstudy variation in disease severity will (likely) affect sensitivity but not specificity, leading to a preference for the bivariate approach.”18 When there are covariates in the model, the HSROC model allows direct evaluation of the difference in accuracy or threshold parameters or both, which affect the degree of asymmetry of the SROC curve, and how much higher it is from the diagonal (the line of no diagnostic information).18 Bivariate models, on the other hand, allow for direct evaluation of covariates on sensitivity or specificity or both. Systematic reviewers are encouraged to look at study characteristics and think through how study characteristics could affect the diagnostic accuracy, which in turn might affect the choice of the metaregression model.
Preferred methods for obtaining a “summary line”
When a summary line is deemed more helpful in summarizing the available studies, we recommend summary lines obtained from hierarchical modeling, instead of several simpler approaches (Table 1).32–36 As mentioned above, when there are no covariates, the parameters of hierarchical summary lines can be calculated from the parameters of the bivariate random effects models using formulas.18, 30, 37 In fact, a whole range of HSROC lines can be constructed using parameters from the fitted bivariate model;37, 38 one proposed by Rutter and Gatsonis30 is an example. The various HSROC curves represent alternative characterizations of the bivariate distribution of sensitivity and specificity, and can thus have different shapes. Briefly, apart from the commonly used RutterGatsonis HSROC curve, alternative curves include those obtained from a regression of logittransformed true positive rate on logittransformed false positive rate; logit false positive rate on logit true positive rate; or the major axis regression between logit true and false positive rates.37, 38
When the estimated correlation between sensitivity and specificity is positive (as opposed to the typical negative correlation) the latter three alternative models can generate curves that follow a downward slope from left to right. This is not as rare as once thought37– a downward slope (from left to right) was observed in approximately one out of three metaanalyses in a large empirical exploration of 308 metaanalyses (report under review, Tufts Evidencebased Practice Center). Chappell et al. argued that in metaanalyses with evidence of positive estimated correlation between sensitivity and specificity (e.g., based on the correlation estimate and confidence interval or its posterior distribution) it is meaningless to use an HSROC line to summarize the studies,38 as a “threshold effect” explanation is not possible. Yet, even if the estimated correlation between sensitivity and specificity is positive (i.e., not in the “expected” direction), an HSROC still represents how the summary sensitivity changes with the summary specificity. The difference is that the explanation for the pattern of the studies cannot involve a “threshold effect”; rather, it is likely that an important covariate has not been included in the analysis (see the proposed algorithm below).38
A special case: joint analysis of sensitivity and specificity when studies report multiple thresholds
It is not uncommon for some studies to report multiple sensitivity and specificity pairs at several thresholds for positive tests. One option is to decide on a single threshold from each study and apply the aforementioned methods. To some extent, the setting in which the test is used can guide the selection of the threshold. For example, in some cases, the threshold which gives the highest sensitivity may be appropriate in medical tests to ruleout disease. Another option is to use all available thresholds per study. Specifically, Dukic and Gatsonis extended the HSROC model to analyze sensitivity and specificity data reported at more than one threshold.39 This model represents as extension of the HSROC model discussed above. Further, if each study reports enough data on sensitivity and specificity to construct a ROC curve, Kester and Buntinx40 proposed a littleused method to combine whole ROC curves.
A WORKABLE ALGORITHM

A summary point may be less helpful or interpretable when the studies have different explicit thresholds for positive tests, and when the estimates of sensitivity vary widely along different specificities. In such cases, a summary line may be more informative.

A summary line may not be well estimated when the sensitivities and specificities of the various studies show little variability or when their estimated correlation across studies is small. Further, if there is evidence that the estimated correlation of sensitivity and specificity across studies is positive (rather than negative, which would be more typical), a “threshold effect” is not a plausible explanation for the observed pattern across studies. Rather, it is likely that an important covariate has not been taken into account.

In many applications, a reasonable case can be made for summarizing studies both with a summary point and with a summary line, as these provide alternative perspectives.
Step 1: Start by considering sensitivity and specificity independently
This step is probably self explanatory; it encourages reviewers to familiarize themselves with the pattern of studylevel sensitivities and specificities. It is very instructive to create sidebyside forest plots of sensitivity and specificity in which studies are ordered by either sensitivity or specificity. The point of the graphical assessment is to obtain a visual impression of the variability of sensitivity and specificity across studies, as well as an impression of any relationship between sensitivity and specificity across studies, particularly if such a relationship is prominent (Fig. 1 and illustrative examples).
If a summary point is deemed a helpful summary of the data, it is reasonable to first perform separate metaanalyses of sensitivity and specificity. The differences in the point estimates of summary sensitivity and specificity with univariate (separate) versus bivariate (joint) metaanalyses is often small. In an empirical exploration of 308 metaanalyses, differences in the estimates of summary sensitivity and specificity were rarely larger than 5 % (report under review, Tufts Evidencebased Practice Center). The width of the confidence intervals for the summary sensitivity and specificity is also similar between univariate and bivariate analyses. This suggests that practically, univariate and multivariate analyses may yield comparable results. However, our recommendation is to prefer reporting the results from the hierarchical (multivariate) metaanalysis methods because of their better theoretical motivation and because of their natural symmetry with the multivariate methods that yield summary lines.
Step 2: Multivariate metaanalysis (when each study reports a single threshold)
To obtain a summary point, metaanalysts should perform bivariate metaanalyses (preferably using the exact binomial likelihood).
Metaanalysts should obtain summary lines based on multivariate metaanalysis models. The interpretation of the summary line should not automatically be that there are “threshold effects”. This is most obvious when performing metaanalyses with evidence of a positive correlation between sensitivity and specificity, which cannot be attributed to a “threshold effect”, as mentioned above.
If more than one threshold is reported per study and there is no strong a priori rationale to review only results for a specific threshold, metaanalysts should consider incorporating alternative thresholds into the appropriate analyses discussed previously. Tentatively, we encourage both qualitative analysis via graphs and quantitative analyses via one of the multivariate methods mentioned above.
Step 3. Explore betweenstudy heterogeneity
Other than accounting for the presence of a “threshold effect”, the HSROC and bivariate models provide flexible ways to test and explore betweenstudy heterogeneity. The HSROC model allows one to examine whether any covariates (study characteristics) explain the observed heterogeneity in the accuracy and threshold parameters. One can use the same set of covariates for both parameters, but this is not mandatory, and should be judged for the application at hand. On the other hand, bivariate models allow one to use covariates to explain heterogeneity in sensitivity or specificity or both; and again, covariates for each measure can be different. Covariates that reduce the unexplained variability across studies (heterogeneity) may represent important characteristics that should be taken into account when summarizing the studies, or they may represent spurious associations. We refer to other texts for a discussion of the premises and pitfalls of metaregressions.24, 42 Factors reflecting differences in patient populations and methods of patient selection, methods of verification and interpretation of results, clinical setting, and disease severity are common sources of heterogeneity. Investigators are encouraged to use multivariate models to explore heterogeneity, especially when they have chosen these methods for combining studies.
Illustrations
We briefly demonstrate the above with two applied examples. The first example on Ddimer assays for the diagnosis of venous thromboembolism15 shows heterogeneity which could be attributed to a “threshold effect” as discussed by Lijmer et al..43 The second example is from an evidence report on the use of serial creatine kinaseMB measurements for the diagnosis of acute cardiac ischemia,16, 17 and shows heterogeneity for another reason.
Ddimers for diagnosis of venous thromboembolism
Ddimers are fragments specific for fibrin degradation in plasma, and can be used to diagnose venous thromboembolism. Figure 1 presents forest plots of the sensitivity and specificity and the likelihood ratios for the Ddimer example.43 Sensitivity and specificity appear more heterogeneous than the likelihood ratios (this is true by formal testing for heterogeneity). This may be due to threshold variation in these studies (from 120 to 550 ng/mL, when stated; Fig. 1), or due to other reasons.43
Second example: Serial creatine kinaseMB measurements for diagnosing acute cardiac ischemia
MetaRegressionBased Comparison of Diagnostic Performance
Metaanalysis metric 
≤3 hours 
>3 hours 
pValue for the comparison across subgroups 

Summary sensitivity (percent) 
80 (64 to 90) 
96 (85 to 99) 
0.036 
Summary specificity (percent) 
97 (94 to 98) 
97 (95 to 99) 
0.56 
Note that properly specified bivariate metaregressions (or HSROCbased metaregressions) can be used to compare two or more medical tests. The specification of the metaregression models will be different when the comparison is indirect (different medical tests are examined in independent studies) or direct (the different medical tests are applied in the same patients in each study).
OVERALL RECOMMENDATIONS

Consider presenting a “summary point” when sensitivity and specificity do not vary widely across studies, and studies use the same explicit or “implicit threshold”.

To obtain a summary sensitivity and specificity use the theoretically motivated bivariate metaanalysis models.

Backcalculate overall positive and negative predictive values from summary estimates of sensitivity and specificity, and for a plausible range of prevalence values rather than metaanalyzing them directly.

Backcalculate overall positive and negative likelihood ratios from summary estimates of sensitivity and specificity, rather than metaanalyzing them directly.


If the sensitivity and specificity vary over a large range, it may be more helpful to use a summary line, which best describes the relationship of the average sensitivity and specificity. The summary line approach is also most helpful when different explicit thresholds are used across studies. To obtain a summary line use multivariate metaanalysis methods such as the HSROC model.

Several SROC lines can be obtained based on multivariate metaanalysis models, and they can have different shapes.

If there is evidence of a positive correlation, the variability in the studies cannot be secondary to a “threshold effect”; explore for missing important covariates. Arguably, the summary line is a valid description of how average sensitivity relates to average specificity.


If more than one threshold is reported per study, this has to be taken into account in the quantitative analyses. We encourage both qualitative analysis via graphs and quantitative analyses via proper methods.

One should explore the impact of study characteristics on summary results in the context of the primary methodology used to summarize studies using metaregressionbased analyses or subgroup analyses.
Acknowledgment
This manuscript is based on work funded by the Agency for Healthcare Research and Quality (AHRQ). All authors are members of AHRQfunded Evidencebased Practice Centers. The opinions expressed are those of the authors and do not reflect the official position of AHRQ or the U.S. Department of Health and Human Services.
Conflict of Interest
The authors declare that they do not have a conflict of interest.