The manner in which accuracy of clinical tests is mathematically summarised in the biomedical literature has important implications for clinicians. Appropriate accuracy measures would be expected to sensibly convey the meaning of the study results with scientifically robust statistics without exaggerating or underestimating the clinical significance of the findings. Lack of use of appropriate measures may lead authors of primary accuracy studies to draw biased conclusions.[1] In systematic reviews of test accuracy literature, there are many ways of synthesising results from several studies, not all of which are considered to be scientifically robust. For example, measures such as sensitivity and specificity commonly used in primary studies are not considered suitable for pooling separately in meta-analysis.[2] Variations in reporting of summary accuracy and use of inappropriate summary statistics may increase the risk of misinterpretation of clinical value of tests.

A recent study evaluated a small sample of meta-analytical reviews of screening tests to demonstrate the variety of approaches used to quantitatively summarise accuracy results.[3] This study confined itself to a limited Medline search. It exclusively examined meta-analytical studies so reviews not using quantitative synthesis were excluded. It did not look at accuracy measures used to report results of primary studies separately from those used for meta-analyses. In order to address these issues, we undertook a comprehensive search to survey systematic reviews (with and without meta-analysis) of test accuracy literature to assess the measures used for reporting results of included primary studies as well as their quantitative synthesis.


We manually searched for relevant reviews in the Database of Abstracts of Reviews of Effectiveness (DARE).[4] In order to limit the impact of human error inherent in manual searching, we complemented it with electronic searching. DARE was searched electronically with word variants of relevant terms (diagnostic, screening, test, likelihood ratio, sensitivity, specificity, positive and negative predictive value) combined using OR. From 1994 to 2000 DARE[4] has identified 1897 reviews of different types by regular electronic searching of several bibliographic databases, hand searching of key major medical journals, and by scanning grey literature (search strategy and selection criteria can be found at The structured abstracts of these reviews were screened independently by the authors to identity systematic reviews of test accuracy. The full texts were obtained of those abstracts judged to be potentially relevant. Reviews addressing test development and diagnostic effectiveness or cost effectiveness were excluded. Any disagreements about review selection were resolved by consensus.

Information from each of the selected reviews was extracted for the measures of test accuracy used to report the results of the primary studies included in the review. If a meta-analysis was conducted, information was also extracted for the summary accuracy measures. The various accuracy measures are shown in Table 1. We sought the following in the primary studies: sensitivity or specificity, predictive values, likelihood ratios and diagnostic odds ratio. For meta-analysis, we sought the summary measures pooling the above results and summary receiver operating characteristics (ROC) plot or values. All extracted data were double-checked. We divided the reviews into two groups arbitrarily according to time of publication; one group covering the period 1994–97 (50 reviews) and another covering 1998–2000 (40 reviews). This allowed us to assess whether there were any significant differences in measures being used to report test accuracy results among reviews published earlier and those published later. As the approaches to summarising results are not mutually exclusive, we evaluated and reported the most commonly used measures and their most common combinations. We used chi-squared statistical test for comparison of differences between proportions.

Table 1 Measures of accuracy of dichotomous test results


Of the abstracts available in DARE, 150 were considered to be potentially relevant. Excluding reviews that addressed test development and diagnostic effectiveness or cost, 90 reviews of test accuracy were left for inclusion in our survey. There were 45 reviews of dichotomous test results, 42 reviews of continuous results dichotomised by the original authors, and 3 reviews that contained both result types. Meta-analysis was used in 60/90 (67 %) reviews, 50 in 1994–97 and 40 in 1998–2000. (See Additional File: BMC_IncludedRefList_04032002 for a complete listing of the 90 reviews included in our study).

As shown in Table 2, sensitivity or specificity was used for reporting the results of primary studies in 65/90 (72%) reviews, predictive values in 26/90 (28%), and likelihood ratios in 20/90 (22%). For meta-analysis, independently pooled sensitivity or specificity was used in 35/60 (58%) reviews, pooled predictive values in 11/60 (18%), pooled likelihood ratios in 13/60 (22%), and pooled diagnostic odds ratio in 5/60 (8%). Summary ROC was used in 44/60 (73%) of the meta-analyses. There were no significant differences between reviews published earlier and those published later as shown in Table 2.

Table 2 Measures of test accuracy reported in review of diagnostic literature (1994–2000)


Our study showed that sensitivity and specificity remain in frequent use, both for primary studies and for meta-analyses over the time period surveyed. Sensitivity and specificity are considered inappropriate for meta-analyses, as they do not behave independently when they are pooled from various primary studies to generate separate averages.[2] In our survey, separate pooling of sensitivities or specificity was used frequently in meta-analyses where summary ROC would have been more appropriate. [57].

Our findings about reporting of summary accuracy measures in meta-analyses are different to those reported previously.[3] We found a higher rate of use of summary ROC, though use of independent summaries of sensitivity, specificity and predictive values were similar. These differences may be due to differences in searching strategies (databases and time frames) and selection criteria. Our search was more recent and comprehensive, using DARE[4], which has covered seven different databases (Medline, CINAHL, BIOSIS, Allied and Alternative Medicine, ERIC, Current Contents clinical medicine and PsycLIT), and hand-searched 68 peer-reviewed journals and publications from 33 health technology assessment centres around the world since February 1994. Moreover, as we did not restrict our selection to meta-analytical reviews only, we were able to examine reviews summarising accuracy results of primary studies without quantitative synthesis, which constituted 33% (30/90) of our sample. Therefore, compared to the previous publication on this topic,[3] our survey provided a broader and more up-to-date overview of the state of reporting of accuracy measure in test accuracy reviews.


The use of inappropriate accuracy measures has the potential to bias judgement about the value of tests. Of the various approaches to reporting accuracy of dichotomous test results, likelihood ratios are considered to be more clinically powerful than sensitivities or specificities.[8] Crucially, it has been empirically shown that authors of primary studies may overstate the value of tests in the absence of likelihood ratios.[1] There is also evidence that readers themselves may misinterpret test accuracy measures following publication.[9] It is conceivable that the problem of inconsistent usage of test accuracy measures in published reviews, as found in our survey, may contribute to misinterpretation by clinical readership. The reason for variation in reported accuracy measures may, in part, be attributed to a lack of consensus regarding the best ways to summarise test results. It is worth noting that despite authoritative publications about appropriate summary accuracy measures in the past,[5, 7, 10] (we have only quoted a few references) inconsistent and inappropriate use of summary measures has remained prevalent in the period 1994–2000. Our paper highlights the need for consensus to support change in this field of research.