Background

Deep vein thrombosis (DVT) is an important cause of mortality and morbidity that requires accurate diagnosis. Ultrasound (US) examination has now largely replaced contrast venography as the standard test for diagnosing clinically suspected DVT [1]. Numerous studies have compared US to contrast venography in patients with clinically suspected DVT. These were most recently summarised by Kearon in 1998 who concluded that US had a sensitivity of 97% for proximal DVT, 72% for distal DVT and a specificity of 94% [2].

Meta-analytic techniques have developed rapidly in recent years. There is increasing recognition that the results of individual studies of a diagnostic test are often subject to substantial heterogeneity and that methodological factors may influence the results of studies [3, 4]. Statistical techniques, such as meta-regression, allow researchers to explore data from systematic reviews for evidence that study-level covariates may influence diagnostic accuracy. There is also an increasing recognition that systematic reviews of diagnostic test data may be subject to publication bias, [4] although solutions to this problem, such as registries of studies, have yet to be developed.

Since US is now established as a definitive diagnostic test for DVT it is unlikely that many new studies evaluating the diagnostic accuracy of US will be forthcoming. This therefore represents an opportune time to undertake a definitive systematic review, meta-analysis and meta-regression of the diagnostic accuracy of US for clinically suspected DVT. We aimed to estimate the sensitivity and specificity of US for DVT, identify study-level covariates that are associated with variation in sensitivity and specificity, and seek evidence of publication bias in diagnostic studies of US for DVT.

Methods

We sought to identify all diagnostic cohort studies of patients with clinically suspected DVT who underwent testing with US followed by a reference standard of contrast venography. We searched Medline, EMBASE, CINAHL, Web of Science, Cochrane Database of Systematic Reviews, Cochrane Controlled Trials Register, Database of Reviews of Effectiveness, and ACP Journal Club (1966 to April 2004). The bibliographies of all articles selected for the review were scanned for potentially relevant articles that were not identified by the original search.

Two reviewers (FS and SG) screened the titles and abstracts of all articles to independently identify potentially relevant articles. Full copies of all selected articles were retrieved and reviewed by the same two reviewers, who independently selected relevant articles. At both stages of selection a Kappa score was calculated and disagreements resolved by discussion. Studies published in English, French, Spanish, Italian or German were included. Studies published in other languages were excluded. Abstracts and letters were included if they reported data in sufficient detail to allow inclusion in the analysis. If not, the authors were contacted and asked to provide details of the data or any full publications.

We specifically excluded case-control studies, in which US results in a group of patients with DVT were compared to a control group of patients without DVT; studies that used a reference standard other than venography; studies with less than ten patients; and studies of patients with suspected pulmonary embolus. Although we collected data from cohorts of asymptomatic patients and mixed cohorts (symptomatic and asymptomatic) we have only reported data here from patients with clinically suspected DVT. The role of US in asymptomatic patients has recently been systematically reviewed [5].

Two independent reviewers (ST and EvB) extracted the following data from the selected studies onto a standardised proforma: the setting for patient recruitment, any exclusion criteria, population demographics, whether recruitment was consecutive and/or data collection prospective, which US technique was used, the US operator, and the number of true positives (proximal and distal), true negatives, false positives and false negatives (proximal and distal), either as reported or calculated from the reported data. The same two reviewers also independently determined whether US was interpreted by observers blind to the venogram result, and whether venography was interpreted by observers blind to the results of US. Discrepancies were checked and resolved by an independent reviewer (FS). If it was not possible to extract the necessary data from the published report we contacted the authors for clarification. We reviewed the data reported by each study and removed studies that contained duplicated data.

Statistical analysis

Random effects models were used to estimate overall sensitivity and specificity, and a Chi-square test for heterogeneity between studies. Where 0 counts occurred for study data, a continuity correction of 0.5 was added to every value for that study in order to make the calculation of sensitivity and specificity defined. These analyses were undertaken using MetaDiSc statistical software [6] and further details of the models fitted is given elsewhere [7]. Initially all studies were analysed together and random effects meta-regression undertaken to identify potential causes of heterogeneity for sensitivity and specificity separately [8] (analysis carried out in STATA). Any covariate that showed an association with sensitivity or specificity (p < 0.1) was selected, and subgroups of studies identified by such covariates were meta-analysed separately. We decided, a priori, to undertake separate analyses of different US techniques: 1) Compression US only; 2) Colour Doppler only; 3) Continuous wave Doppler only; 4) Duplex (combined compression and colour Doppler US); 5) Triplex (combined compression, colour Doppler and continuous wave Doppler US).

Funnel plots were used to explore for evidence of publication bias. For both sensitivity and specificity the standard error of the log odds of the parameter was plotted against the log odds [9].

Repeat or serial US

Repeat or serial US is often used to identify distal DVT, missed by the initial scan, that extend proximally and may thus be detected by US after an appropriate time delay (usually one week). We sought to identify studies of repeat or serial US in the main systematic review. However, we realised that we were unlikely to identify many studies that fulfilled our inclusion criteria, because of the logistic and ethical difficulties of asking patient to undergo successive US examinations followed by contrast venography. We therefore recorded separately any studies that reported use of serial or repeat US with clinical follow-up of patients, but which did not perform venography in all (or any) patients. Analysis simply consisted of recording the number of positive initial and repeat scans to estimate the yield of positive repeat scans.

Results

The flow of articles is outlined in figure 1. We scanned 3992 titles/abstracts and selected 400 potentially relevant articles for retrieval (kappa = 0.85). Review of the full articles identified 151 that met the inclusion criteria (kappa = 0.90). Review of the bibliographies of the selected articles identified six additional articles for inclusion. Six articles duplicated data published elsewhere and were excluded. We were unable to extract or analyse appropriate data from a further nine articles, despite attempts to contact the authors. Some 43 articles reported asymptomatic or mixed cohorts, so 99 articles were included in the meta-analysis. One article reported two cohorts, so the meta-analysis included a total of 100 cohorts [10108].

Figure 1
figure 1

Flow diagram of studies considered for the review.

Characteristics of the included cohorts

The studies reported a total of 10323 patients, with cohorts varying in size from 11 to 847 patients (median N = 72). The studies varied in the way they reported their findings: 53 reported proximal and distal DVT separately, 19 only reported proximal DVT, three only reported distal DVT, and 25 were unclear or reported proximal and distal DVT together. DVT prevalence varied from 20% to 94% (median 48%). The proportion of proximal DVT (of all DVT detected) ranged from 48% to 100% (median 78%). The mean or median age was reported by 60 studies, and ranged from 39 to 68 (median 57). The male to female ratio was reported by 65 studies, with the proportion of males ranging from 15% to 95% (median 45%).

Cohorts were recruited from the following settings: outpatient clinic-11, inpatients-12, emergency department-4, mixed-18, and not stated-55. Recruitment was reported to be consecutive in 48, and prospective in 67. Twelve cohorts excluded patients with previous DVT, while 45 papers did not report any exclusion criteria. The following techniques were used: 22 used compression ultrasonography alone, five used Colour Doppler alone, 16 used continuous wave Doppler alone, 25 used triplex, 28 used duplex, and four used other techniques. Ultrasound was interpreted blind to the results of venography in 62 cohorts and was unclear in 38. Venography was interpreted blind to the ultrasound result in 56 cohorts, was interpreted by observers aware of ultrasound result in two, and was unclear in 42.

Results of meta-analysis

Figures 2 and 3 show the Forest plots of sensitivity and specificity respectively. Point estimates of sensitivity and specificity are plotted, with 95% confidence intervals, for each cohort. Pooled sensitivity (95% CI, p-value for heterogeneity) for detecting any DVT was 89.7% (88.8 to 90.5, p < 0.001). Pooled sensitivity for detecting proximal DVT was 94.2% (93.2 to 95.0, p < 0.001) and for distal DVT was 63.5% (59.8 to 67.0, p < 0.001). Pooled specificity, calculated using data from all the studies, was 93.8% (93.1 to 94.4, p < 0.001). When restricted to the 53 studies reporting full data specificity was 94.2% (93.4 to 95.0, p < 0.001). Great care should be taken when interpreting these estimates because of the substantial heterogeneity. It may be argued that calculating summary estimates in these circumstances is inappropriate. However, it does provide a useful baseline from which to explore heterogeneity.

Figure 2
figure 2

Forest plot of sensitivity.

Figure 3
figure 3

Forest plot of specificity.

Results of meta-regression

We undertook random effects meta-regression to identify possible causes for the heterogeneity. The results of meta-regression are outlined in Table 1. Using a threshold of p < 0.1 for statistical significance, interpretation by a radiologist, prevalence of DVT, the proportion of proximal DVT and date of publication were all significant predictors of sensitivity. The only significant predictor of specificity was exclusion of patients with a previous history of DVT.

Table 1 Results of meta-regression

More recently published studies, those with a higher prevalence of DVT and those with a higher proportion of proximal DVT tended to have higher sensitivity. There were 33 studies in which the operator was reported as being a radiologist. Meta-analysis showed that that diagnostic performance was generally slightly worse among these studies. Overall sensitivity (95% CI) was 86.1% (83.8 to 88.3), sensitivity for proximal DVT was 94.4% (92.3 to 96.1), sensitivity for distal DVT was 62.6% (55.4 to 69.4), and specificity was 92.4% (90.9 to 93.7). Twelve cohorts reported excluding patients with previous DVT. Meta-analysis showed that that specificity was higher amongst these cohorts: 97.6% (96.6 to 98.3).

Table 2 shows pooled estimates of sensitivity and specificity stratified by US technique used. Optimal sensitivity is achieved by using duplex or triplex, while optimal specificity is achieved by using compression alone.

Table 2 Pooled estimates of sensitivity and specificity stratified by US technique

Funnel plots

These are shown in figure 4 (sensitivity) and figure 5 (specificity). Both plots are asymmetrical, suggesting that smaller studies tend to report higher sensitivity and specificity. One possible explanation of this is publication bias. Smaller studies reporting lower sensitivity or specificity may be less likely to be submitted or accepted for publication.

Figure 4
figure 4

Funnel plot for sensitivity.

Figure 5
figure 5

Funnel plot for specificity.

Repeat or serial US

We did not identify any studies that compared the results of repeat or serial scanning to venography in a complete cohort of patients, so none were included in the meta-analysis. However, we did identify several studies that reported the results of repeat US: five studies used repeat scanning for unselected cohorts with suspected DVT, [109113] while four used repeat scanning for selected groups, based on the results of clinical risk scoring or D-dimer measurement [114117] Three studies used venography in some patients to confirm the results of positive repeat scanning [109, 112, 113]. Results from these studies are summarised in Table 3.

Table 3 Studies of repeat US

In unselected cohorts repeat scanning had a positive yield of zero to 2%. Where venography was used to confirm positive findings, the positive predictive value of ultrasound was 82 to 94%. Overall, our best estimate of the positive yield of repeat scanning in unselected patients is 35/2610 (1.34%; 95% CI 0.97 to 1.86%) with a positive predictive value of 146/164 (89.0%; 95% CI 83.3 to 92.9%).

When repeat scanning is restricted on the basis of clinical probability or D-dimer the results suggest a higher yield of positive scans, although none of the studies used venographic confirmation. Two studies of repeat ultrasound limited to patients with a positive D-dimer produced an overall positive scan yield of 22/606 (3.63%; 95% CI 2.42 to 5.44%) [116, 117].

Discussion

The diagnostic accuracy of US for DVT varies according to the technique used. Optimal sensitivity is achieved by using duplex (proximal sensitivity 96%, distal sensitivity 71%, specificity 94%) or triplex US (proximal sensitivity 96%, distal sensitivity 75%, specificity 94%). Optimal specificity is achieved by using compression US alone (proximal sensitivity 94%, distal sensitivity 57%, specificity 98%). These findings suggest that compression US alone is probably the appropriate technique for most patients, if scanning is aimed simply at identifying proximal DVT. Most patients have a low probability of DVT, so optimal specificity is required to avoid generating excessive false positive results. However, when evaluating patients at high risk of DVT, or if scanning aims to identify distal DVT, then duplex or triplex US will probably be the appropriate technique.

Beyond US technique we identified few study-level predictors of sensitivity or specificity. Sensitivity tended to be higher in more recent studies, probably reflecting developing technology and expertise. Sensitivity was surprisingly lower in studies where scans were interpreted by a radiologist. This may be because these studies were more likely to use techniques at an earlier stage in their development. Another cause could be that compression ultrasonography is the simplest technique, whereas Doppler and colour US techniques are more challenging and therefore more likely result in greater reporting variability. The association between proportion of proximal DVT and sensitivity is unsurprising as US has better sensitivity for proximal DVT. The association between DVT prevalence in the study cohort and sensitivity may be explained by a similar mechanism. Selection of a cohort with a higher prevalence of DVT is likely to involve selection of cases with more easily detectable (i.e. larger and more proximal) DVT. Prevalence has been shown to be associated with variation in the performance of other diagnostic tests for DVT. Heim et al [118] showed that D-dimer has poorer accuracy in cohorts with a higher prevalence of DVT, probably due to lower specificity.

We identified no studies to reliably estimate the diagnostic accuracy of repeat scanning in comparison to contrast venography. Our best estimate of the diagnostic value of repeat scanning is that, in unselected patients with suspected DVT, it will have a positive yield of 1.3%, of whom 89% will be true positive and 11% false positive. A higher yield may be achieved by limiting repeat scanning to patients with a high clinical risk score and/or positive D-dimer. Whether these yields of positive scanning justify use of repeat scanning depend upon our estimates of the costs, benefits and risks of treating, or not treating, cases of DVT.

This study has some limitations that need to be considered. We did not search for unpublished data or studies published in languages other than English, French, Spanish, Italian or German. Studies of diagnostic tests are relatively easy to undertake, are often unfunded, and are not usually recorded on research registries. It is therefore unsurprising that systematic reviews of diagnostic test data rarely search for unpublished data [4] and that the potential effect of publication bias is unknown. Funnel plots for sensitivity and specificity were both asymmetrical. One possible explanation for this is that small studies reporting poor sensitivity or specificity may be less likely to be submitted or accepted for publication. If this is the case then the values for pooled sensitivity and specificity may represent over-estimates.

Despite undertaking meta-regression and stratifying results by US technique our findings were subject to significant unexplained heterogeneity. This heterogeneity is probably due to factors that were inadequately reported in the primary studies and therefore could not be explored in meta-regression. These factors include the characteristics of patients recruited (such as the prevalence of previous thromboembolism, obesity and co-morbidities), the training and experience of US operators, specific features of the US technique (such as the US frequency used), and any time delay between scanning and venography. These factors may have had a substantial influence upon sensitivity and specificity that will not have been identified in our analysis. Poor reporting also limited our ability to explore the effect of study design upon results. Use of blinding was often not described, studies rarely reported how uncertain or equivocal test results were handled, and the median prevalence of DVT in the cohorts (48%) suggests selective sampling of patients. These methodological weaknesses in the primary studies constitute a weakness in our meta-analysis.

The findings relating to repeat US scanning are subject to even greater limitations. Only a relatively small number of studies were identified and none compared repeat US to a venography in all cases. The potential benefit of repeat US is therefore very uncertain.

A potential clue to the influence of patient characteristics upon sensitivity and specificity is provided in a study by Wells et al [119], who reported their results stratified by the patient's clinical risk score into high, intermediate or low risk. Among patients with a high Wells score sensitivity (95% CI) was 91% (81 to 96) and specificity was 100 % (77 to 100). Among patients with an intermediate Wells score sensitivity was 61% (46 to 74) and specificity was 99% (94 to 100). Among patients with a low Wells score sensitivity was 67% (42 to 85) and specificity was 98% (95 to 99). This suggests that US sensitivity may be dependent upon clinical probability of DVT and concurs with our finding that sensitivity was higher in cohorts with higher prevalence.

The widespread current use of US to diagnose DVT is not based upon diagnostic cohort studies alone, but also upon management studies, in which cohorts of patients with negative US results are not treated, but followed up to identify evidence of missed thromboembolism. Studies of serial US [109112, 120], a single full-leg US [121124], or US as part of a diagnostic algorithm [114, 116, 117, 125129] have shown low rates of thromboembolism during three to six month follow up. This suggests that, although our meta-analysis has shown that US does not have perfect sensitivity for DVT (especially distal thrombus), this does not translate into high rates of adverse outcome. This may be because application of a reasonably sensitive test to a population with low disease prevalence will result in a high negative predictive value, or it may be because DVT that are missed by ultrasound have a relatively benign natural history.

Conclusion

US has high sensitivity for proximal DVT, modest sensitivity for distal DVT and high specificity. Optimal sensitivity, particularly for distal DVT, is achieved by using duplex or triplex US, while optimal specificity is achieved by using compression US alone. US sensitivity appears to be higher in cohorts with higher DVT prevalence. However, these findings are subject to substantial unexplained heterogeneity and should be interpreted with caution. Evaluation of repeat US has been very limited and its' potential benefit is very uncertain.