Background

In the context of the human immunodeficiency virus (HIV) epidemic, clinicians frequently encounter extrapulmonary and disseminated forms of tuberculosis (TB) [13]. In the USA, nearly 20% of the TB cases are extra-pulmonary [2]. In England and Wales, 38% of the TB cases are extrapulmonary [3]. Tuberculous pleuritis is a common manifestation of extrapulmonary TB [4]. TB is the most common cause of pleural effusion in many countries [4]. For example, studies from Spain [5], Malaysia [6], and Saudi Arabia [7] showed that TB accounted for 25%, 44%, and 37% of all effusions respectively. In the USA, the annual incidence of tuberculous pleuritis has been estimated to be about 1000 cases, and approximately one in 300 patients with TB will have tuberculous pleuritis [4, 8]. The incidence of tuberculous effusions may be higher in patients with HIV infection [9].

Conventional diagnostic tests include microscopy of the pleural fluid, culture of pleural fluid, sputum pleural tissue, and pleural biopsy [4]. These tests have limitations. Microscopy of the pleural fluid is rarely positive (<5%) [1012]. Culture of pleural fluid has low sensitivity (24% – 58%), and results are not available for weeks [1113]. Biopsy of pleural tissue, and culture of biopsy material are widely held to be the best methods of confirming the diagnosis [4, 10, 12]. This combination may lead to the diagnosis 86% of the time [10]. Although not perfect, culture and/or biopsy, therefore, are widely considered the standard of diagnosis [4, 10]. However, pleural biopsy is invasive, operator-dependent, and technically difficult (particularly in children) [14].

Because of the limitations of conventional tests, newer and rapid tests such as nucleic acid amplification tests – including polymerase chain reaction (PCR) – have been evaluated. Because of their high sensitivity and specificity in smear-positive respiratory specimens, these tests are now used – mainly in developed countries – for the direct detection of M. tuberculosis complex in respiratory specimens [15, 16]. NAA tests are categorized as commercial or in-house ("home-brew") tests. Commercial tests include the Amplified Mycobacterium tuberculosis Direct Test® (MTD) (Gen-Probe Inc, San Diego, CA), the Amplicor® MTB tests (Roche Molecular Systems, Branchburg, NJ), and the recently discontinued LCx® test (Abbott Laboratories, Abbott Park, IL). In the USA, the Amplicor test is licensed for use in smear-positive respiratory specimens; the MTD test is approved for smear-positive as well as smear-negative respiratory specimens [16]. No commercial test is licensed for use in non-respiratory specimens. We conducted this systematic review and meta-analysis to determine the overall accuracy of NAA tests in the diagnosis of tuberculous pleuritis, and to identify factors associated with heterogeneity of results between studies.

Methods

Search strategy

We searched the following databases: PubMed (1985 – January 2003), EMBASE (1988–2002), Web of Science (1990–2002), BIOSIS (1993–2002), Cochrane Library (2002; Issue 2), and LILACS (1990–2002). The Journal of Clinical Microbiology, a high-yield journal for TB diagnostic studies, was also hand-searched separately. All searches were up to date as of August 2002. The PubMed search was updated in January 2003. The search terms used were: tuberculosis, Mycobacterium tuberculosis, nucleic acid amplification techniques, polymerase chain reaction, sensitivity and specificity, and accuracy. Experts in the field were contacted. Bibliographies from the included studies and relevant review articles were screened. We obtained lists of citations from companies that manufacture commercial tests. Although no language restrictions were imposed initially, for the full-text review and final analysis our resources only permitted review of English and Spanish articles. Conference abstracts were excluded because of the limited data presented in them.

Study selection

Our search strategy was designed to include all published studies on NAA tests for the direct detection of M. tuberculosis in pleural fluid specimens. For inclusion, a study had to:

1. report a comparison of an NAA test against a reference standard, and provide data necessary for the computation of both sensitivity and specificity;

2. include at least 10 pleural fluid specimens (since studies with very few specimens are vulnerable to selection bias [17])

Studies on use of NAA tests on pleural biopsy and/or cytology specimens were excluded.

Two reviewers (MP and LLF) independently judged study eligibility while screening the citations. Disagreements were resolved by consensus. A list of excluded studies and a log of reasons for exclusion are available from the authors upon request.

Data extraction and quality assessment

Data extraction was performed by two reviewers. One reviewer (MP) extracted the data from all English studies. Another reviewer (LLF) extracted data from all Spanish articles. The second reviewer (LLF) also independently extracted data from a subset (36%) of the English articles, in order to determine the inter-rater agreement. The abstracted data included methodological quality, participant characteristics, test methods, and outcome data.

Quality assessment was performed using methods adapted from two guidelines on systematic reviews of diagnostic studies [17, 18]. For each study, the following quality criteria were scored as fulfilled or not: 1) Independent comparison of NAA test against reference standard; 2) Cross-sectional design (versus case-control design) ; 3) Blinded (single or double) interpretation of test and reference standard results; 4) Consecutive or random sampling of patients; 5) Prospective data collection; 6) Inclusion of at least 10 specimens/patients with confirmed tuberculous pleuritis. If no data on the above criteria were reported in the primary studies, we requested the information from the authors. For the purposes of analysis, responses coded as "not reported" were grouped together with "not met." A high quality study was arbitrarily defined as that which met at least 5/6 criteria; a medium quality met 3 or 4 of the 6 criteria; and a low quality study met less than 3/6 criteria. Since discrepant analysis (where discordant results between NAA test and reference test results are resolved, post-hoc, using clinical data) may be a potential source of bias, we preferentially included unresolved data.

Statistical analysis and data synthesis

We used standard methods recommended for meta-analyses of diagnostic test evaluations [1719]. Analyses were performed using Meta-Test [20], and Stata version 8 (Stata Corporation, Texas). We computed the following measures of test accuracy for each study: sensitivity [true positive rate (TPR)], specificity [1-false positive rate (FPR)], positive likelihood ratio (LR+), negative likelihood ratio (LR-), and diagnostic odds ratio (DOR). These measures were pooled using the random effects model [1719].

Each study in the meta-analysis contributed a pair of numbers: TPR and FPR. Since TPR and FPR are not independent, we summarized their joint distribution by constructing a summary receiver operating characteristic (SROC) curve [21]. Unlike a traditional ROC plot used to explore the effect of varying thresholds (cut-points) on TPR and FPR in a single study, each data point in the SROC plot represents a separate study the meta-analysis. The SROC curve (and area under the curve) represents the overall performance of the test, and depicts the trade off between sensitivity and specificity. A symmetric curve suggests that the variability in accuracy between studies is explained, in part, by differences in thresholds employed by the studies.

Heterogeneity in meta-analyses refers to the degree of variability in results across studies. We used the Chi-square and Fisher's exact tests to detect statistically significant heterogeneity. Stratified (subgroup) analyses were used to identify study design and test-related factors responsible for heterogeneity in test accuracy. Studies using commercial tests were analyzed separately from those using in-house tests. Studies with commercial tests were further stratified by type of test (brand). Finally, since publication bias is of concern for meta-analyses of diagnostic studies [22], we tested for the potential presence of this bias using funnel plots and the Egger test [23].

Results

Description of included studies

Figure 1 outlines our study selection process. Thirty-eight articles [2461] were included in the meta-analysis. Four articles were in Spanish [26, 37, 39, 44]. Two articles reported evaluations of more than one NAA test against a common reference standard [30, 57]. Each such test comparison was counted as a separate study. Thus, the total number of test comparisons (hereafter referred to as studies) was 40. Of these, 14 (35%) were studies of commercial tests, and 26 (65%) were of in-house tests. The average (median) sample size of each study in the meta-analysis was 60 pleural specimens or subjects, with a range of 15 to 375.

Figure 1
figure 1

Study selection process

Study characteristics and quality

The mean inter-rater agreement between the two reviewers for items in the quality checklist was 0.86. Our initial data were affected by incomplete reporting in the primary studies. We contacted the authors via email, and obtained additional data for 25/40 (63%) included studies. Tables 1 and 2 present the descriptive data from studies with commercial and in-house tests, respectively. The tables present data on study quality, along with sample size, sensitivity and specificity estimates. In the commercial tests subgroup (Table 1), 36%, 50%, and 14% of the studies were of high, medium and low quality, respectively. In the in-house tests subgroup (Table 2), 39%, 42%, and 19% of the studies were of high, medium and low quality, respectively. Tables 1 and 2 show the variability in study quality and variability in NAA protocols employed. Although a variety of reference tests (or combinations of reference tests) were used, culture alone, or a combination of culture plus biopsy/microscopy/clinical data were used in 39 of the 40 studies.

Table 1 Description of included studies – commercial testsa
Table 2 Description of included studies – In-house PCR testsa

Diagnostic accuracy of commercial tests

Figure 2(A) displays the sensitivity and specificity estimates from each of the 14 studies using commercial tests, stratified by type (brand) of test. Almost all studies had specificity estimates close to 1.0. In contrast, sensitivity estimates were lower and heterogeneous (range 0.20–1.0). Figure 3(A) shows the SROC curve for the commercial tests. The regression line does not trace out a typical ROC curve – the curve shows no trade-off between sensitivity and specificity. Table 3 presents the results of the meta-analysis. The summary measures of specificity and LR+ were very high and homogeneous. All other measures were highly heterogeneous.

Figure 2
figure 2

Forest plots of estimates of sensitivity and specificity in studies with commercial and in-house tests. The point estimates of sensitivity and specificity from each study are shown as solid circles. Error bars are 95% confidence intervals. Numbers indicate the studies cited in the bibliography. Pooled estimates are summary random effects estimates with 95% confidence intervals.

Figure 3
figure 3

Summary Receiver Operating Characteristic curves for commercial and in-house tests Each solid circle represents each study in the meta-analysis. The size of each study is indicated by the size of the solid circle. The weighted (dark line) and unweighted (thin line) regression SROC curves summarize the overall diagnostic accuracy.

Table 3 Summary accuracy measures for commercial and in-house tests

Commercial tests use different amplification methods and target nucleic acid sequences: the Amplicor test employs PCR technology to amplify the 16s rRNA target; the LCx test employs ligase chain reaction to amplify the 38 kDa target; and the MTD test utilizes transcription-mediated amplification to amplify rRNA. To account for these differences, we further stratified commercial tests by type (brand) of test. Four of the studies evaluated the Amplicor test [28, 42, 54, 55], four the LCx test [37, 44, 46, 52], and six evaluated the MTD test [26, 3335, 49, 61]. Stratification reduced the overall heterogeneity to some extent (Table 3 and Figure 2A). Although specificity did not vary by type of kit, sensitivity did – the Amplicor test had a lower sensitivity (0.37) than the LCx (0.72) and MTD (0.77) tests. Since 5/6 studies on the MTD test evaluated the first generation MTD-1 kit, there were insufficient numbers of MTD-2 studies to determine if the enhanced MTD-2 test (licensed for use in smear-negative respiratory specimens) performed better than the MTD-1 test.

Diagnostic accuracy of in-house tests

Figure 2(B) displays the sensitivity and specificity estimates from each of the 26 studies using in-house tests. Both sensitivity (range 0.20–1.0) and specificity (range 0.53–1.0) estimates were highly variable. All summary measures were grossly heterogeneous (Table 3) and therefore would not be appropriately summarized. The SROC curve [Figure 3(B)] displays a ROC-type trade-off between sensitivity and specificity.

We performed stratified analyses to identify sources of heterogeneity among in-house tests. Table 4 presents two factors that appeared most strongly associated with the observed heterogeneity. Studies that employed a case-control design produced diagnostic odds ratio estimates nearly 2.4 times higher than studies that employed a cross-sectional design. Studies with PCR tests that used the IS6110 target sequence produced diagnostic odds ratio estimates 2.5 times higher than studies that used PCR tests with other targets. However, even after stratifying on study design and target sequence, considerable unexplained heterogeneity persisted in all the summary measures. The shape of the SROC curve suggested that variability in diagnostic thresholds (cut-points) across studies could partly explain the heterogeneity.

Table 4 Stratified analyses for the evaluation of heterogeneity in studies with in-house tests

Publication bias

In the subgroup with commercial tests, the Egger test was not statistically significant (p = 0.55). However, in the in-house tests subgroup the Egger test was significant (p = 0.002), with an asymmetric funnel plot (figure 4) – evidence in favour of potential publication bias.

Figure 4
figure 4

Funnel plot for evaluation of publication bias in studies with in-house tests The funnel graph plots the log of the diagnostic odds ratio (DOR) against the standard error of the log of the DOR (an indicator of sample size). Each open circle represents each study in the meta-analysis. The line in the center indicates the summary DOR. In the absence of publication bias, the DOR estimates from smaller studies are expected to be scattered above and below the summary estimate, producing a triangular or funnel shape. The funnel plots appear asymmetric – smaller studies with low DOR estimates are missing – indicating a potential for publication bias. The Egger test for publication bias was statistically significant (p = 0.002) in the in-house test subgroup.

Discussion

Since conventional tests are not always helpful in establishing a diagnosis of tuberculous pleuritis, several rapid tests and biomarkers have been evaluated: Adenosine Deaminase (ADA) [12, 14, 45, 51, 59, 62], Interferon-γ (IFN-γ) [59, 60, 62, 63], lysozyme [62], soluble interleukin 2 receptors [63], and NAA tests [2461]. There has been an explosion of studies evaluating these rapid tests, and systematic reviews and meta-analyses are necessary to synthesize this growing body of literature. A recent meta-analysis summarized the evidence on ADA and IFN-γ for the diagnosis of tuberculous pleuritis [64]. Both ADA and IFN-γ tests were found to be reasonably accurate at detecting tuberculous pleuritis. Our meta-analysis summarizes the evidence on accuracy of NAA tests in the diagnosis of tuberculous pleuritis.

Principal findings

The role of NAA tests has been reasonably well defined in pulmonary tuberculosis [15, 16, 65], and guidelines exist for testing of respiratory specimens [16]. In contrast, their role in the evaluation of specimens such as pleural fluid is not clear. Our results indicate that commercial NAA tests have high specificity and positive likelihood ratios. These test properties suggest a potential role for commercial tests in confirming (ruling in) the diagnosis of tuberculous pleuritis. These tests, however, have low and widely varying sensitivities – test properties that make them unhelpful in ruling out TB. Potential explanations for the low sensitivity include a low bacillary load in pleural fluid, or the presence of substances in the pleural fluid that inhibit amplification [65]. Some authors have suggested that pleural fluids should be tested with NAA methods after the specimens are adequately pre-treated to remove inhibitors [65]. All commercial kits appear to be designed to maximize only specificity. The MTD and LCx kits appear to have higher sensitivity than the Amplicor test. This comparison should be interpreted cautiously because it is based on few studies. Studies that directly compare these commercial tests (head-to-head) within the same study population are required to confirm these observations. The most important finding regarding in-house PCR is the significant heterogeneity across studies.

Clinical implications

To interpret the summary measures in a clinical context, consider a patient from a high incidence setting (e.g. countries such as Spain or Malaysia) who is estimated to have a 50% probability of pleural TB after clinical evaluation, and is evaluated with either the MTD test (LR+ of 17.4 and LR- of 0.31) or the Amplicor test (LR+ of 52.8 and LR- of 0.59). A LR+ of 17.4 for the MTD test suggests that patients with tuberculous pleuritis have a 17-fold higher chance of being MTD test positive as compared to patients without TB. If the MTD test were positive, the likelihood that this patient has TB increases from 50% to 95%, a probability that is sufficiently high to justify initiation of anti-tuberculosis treatment. A positive Amplicor test will raise the probability of TB from 50% to 97%. In contrast, if the MTD test result were negative, there is still a 24% chance that this patient has TB, probably not sufficiently low to rule out TB with confidence. In case of the Amplicor test, a negative test will reduce the probability from 50% to 40%, again not low enough to rule out TB.

Consider another patient from a low incidence setting (e.g. countries such as the USA), where the baseline probability of TB is low (e.g. 5%). If MTD test were positive, the likelihood that this patient has TB increases from 5% to 48%, a probability that justifies further investigation. A positive Amplicor test will raise the probability of TB from 5% to 75%. If the MTD® test result were negative, the baseline probability changes from 5% to 2%, a negligible shift that is unlikely to be helpful in clinical decision-making. In case of the Amplicor test, the probability changes from 5% to 4%. These examples illustrate the impact of the baseline prevalence (pre-test probability) on predictive values of the tests.

The accuracy of in-house PCR was heterogeneous across studies, and thus meaningful summary measures of accuracy could not be determined. The clinical implications, therefore, will depend on the setting. Institutions that use in-house PCR will have to rely on local data to decide on its accuracy and clinical applicability. In general, PCR for tuberculosis is known to have poor inter-laboratory reproducibility [66].

In addition to the effect of diagnostic thresholds seen in the SROC plot, we identified two factors that were associated with heterogeneity among in-house tests: use of a case-control design, and use of the IS6110 target sequence. Case-control studies sample patients from the extreme ends of the clinical spectrum (an ideal, "extreme contrast" setting). If the sensitivity of a test is evaluated in seriously diseased subjects, and specificity in healthy individuals, both measures will overestimate the true diagnostic accuracy [67]. Empiric research suggests that case-control studies overestimate the diagnostic odds ratio by a factor of 3, when compared to cross-sectional studies [68]. Future studies of NAA tests could reduce this bias by avoiding the case-control design and recruiting consecutive series of patients in whom the test is clinically indicated (a realistic, "clinical practice" setting). The IS6110 target sequence is widely used in M. tuberculosis fingerprinting [69]. Because this target is specific to the M. tuberculosis complex, and because it is usually present as multiple copies in the genome, PCR tests using this target might be more sensitive. Further research is underway to confirm this finding, in a larger meta-analysis of in-house PCR in the diagnosis of pulmonary tuberculosis.

Previous meta-analyses of NAA test accuracy

Our data are consistent with the results of two previous meta-analyses on the accuracy of NAA tests. Sarmiento and colleagues summarized the accuracy of PCR in the diagnosis of smear-negative pulmonary TB [70]. Their meta-analysis of 50 studies showed that both sensitivity and specificity estimates were heterogeneous. They concluded that PCR is not consistently accurate enough to be routinely recommended for the diagnosis of smear-negative TB. Our previous meta-analysis of 49 studies summarized the accuracy of NAA tests in the diagnosis of tuberculous meningitis [71]. Commercial tests were found to have high overall specificity (0.98) and low sensitivity (0.56). The accuracy of in-house PCR was not determined because of heterogeneity in study results.

Limitations of the review

Our review has limitations. Our analysis lacks data on the incremental gain of NAA tests over and above the diagnostic performance achieved by using only conventional methods or other rapid tests like ADA and IFN-γ. The primary studies in our review did not report such data. Also, few studies in our review directly compared the NAA test against tests such as ADA and IFN-γ[45, 51, 59]. Only one study [59] directly compared the three tests in the same population, and showed that ADA, IFN-γand PCR were 88%, 86%, and 74% sensitive respectively, and 86%, 97%, and 90% specific respectively, for culture or biopsy-confirmed pleural TB. Since we did not include tests such as ADA and IFN-γ in our literature searches, our review cannot identify the most accurate test. Also, publication bias was a concern with the in-house tests. Exclusion of studies published in languages other than English and Spanish could have contributed to this potential bias.

Conclusions

In summary, our data suggest a potentially useful role for commercial NAA tests in confirming a diagnosis of tuberculous pleuritis. However, commercial kits have low and varying sensitivities, and therefore should not be used for excluding a diagnosis of tuberculous pleuritis. NAA test results, therefore, cannot replace conventional tests; they need to be interpreted in parallel with clinical findings and results of conventional tests. The accuracy of in-house PCR tests is poorly defined because of heterogeneity in study results. Clinically useful summary measures cannot be estimated for in-house PCR tests; their clinical applicability remains unclear.