Background

Testing for the presence of micro-organisms in the urinary tract, in order to diagnose asymptomatic bacteriuria or symptomatic urinary tract infections (UTI), is very common at all levels of health care. UTI are a common cause of fever in young children, often accompanied by subtle and non-specific clinical findings [1]. In a small percentage of children this may lead to kidney scarring, and at a later age to hypertension, and even renal failure [2]. In general practice, 2–3% of all consultations, and even 6% in the case of women, are due to symptoms suggesting UTI [3]. The prevalence of asymptomatic bacteriuria is 4–7% in pregnancy, when it can progress to symptomatic UTI, postpartum UTI or pyelonephritis [4, 5]. Untreated bacteriuria during pregnancy has been shown to be associated with low birth-weight and premature delivery [6]. Bacteriuria is more common with increasing age. Elderly non-institutionalised women and men show a prevalence rate of 6 – 30% and 11–13%, respectively, while in institutionalised elderly people the prevalence ranges from 25 to 50% [7].

Many tests are available for the diagnosis of bacteriuria or UTI. A (semi-) quantitative culture of a urine specimen is the only method that can provide detailed documentation of a bacterial urine infection. However, making a culture is costly, and takes at least 24 hours. An ideal test requires only limited technical expertise, is cheap and has a high accuracy, enabling a quick diagnosis in high-risk patients [2, 8]. One example is the dipstick test, where only nitrites and leukocyte esterase – and not proteins and blood – show fair accuracy, compared with a quantitative culture [9].

In the past 25 years, many studies have evaluated the accuracy of dipsticks tests as rapid detectors of bacteriuria and UTI in different populations and age groups. Several narrative reviews have been written [6, 912], and two meta-analyses [1, 13] have been performed. The meta-analysis by Hurlbut and Littenberg [13] did not report on sources of heterogeneity. The most recent meta-analysis [1] of 26 studies in children, showed major heterogeneity of diagnostic accuracy across studies, which could not be fully explained by differences in age, or by differences in the definition of the criterion standard. The lack of an adequate explanation for the heterogeneity of the dipstick accuracy stimulates an ongoing debate. Many elements and differences in the process of urine-collection and analysis, and in the selection of patients, may influence the presence of micro-organisms which can be detected by the dipstick, as well as the presence of substances that may give false results [10, 1416]. The methodological quality of the studies might also be an important determinant of the reported accuracy [17].

The objective of the present meta-analysis was to summarise the available evidence on the diagnostic accuracy of the urine dipstick test, taking into account various pre-defined potential sources of heterogeneity.

Methods

Literature search

Standardised searches were conducted in 1998 and 1999 in computerised databases (Medline and Embase), by reference tracking [18] and through personal contacts with experts in the field of research. In January 2000, the search was extended and updated by conducting an on-line Medline search at the PubMed website http://www.ncbi.nlm.nih.gov/pubmed Table 5 [see Additional file 1].

Two reviewers (WLJMD, JCY) selected the studies. The following inclusion criteria were applied: publications should concern the diagnosis of bacteriuria or urinary tract infections, investigate the use of dipstick tests for nitrites and/or leukocyte esterase, and present empirical data. Excluded were studies which focused only on sexually transmitted diseases, urethritis or schistosomiasis, studies with no accepted criterion standard (at least semi-quantitative or quantitative urine culture), studies which did not provide sufficient data for the reconstruction of a diagnostic two-by-two table, and studies which based test positivity on the combination of various other tests jointly with tests for nitrites and/or leukocyte esterase. Studies carried out before 1990 and studies in animals were also excluded. There were no language restrictions. When consensus was not reached, a third reviewer (NPD) was consulted to resolve disagreements.

Quality and applicability of studies

The checklist of the Cochrane Methods Working Group on Meta-analysis of Diagnostic and Screening Tests was used to assess the methodological quality of the selected studies [19] (available on request). Three reviewers (CJY, NPvD, WLJMD) independently assessed all selected publications. Disagreements were resolved in consensus meetings.

Internal validity criteria (IV) were scored as 'positive' (adequate methods), 'negative' (inadequate methods, potential bias) or 'no information'. External validity criteria (EV) were scored positive if sufficient information was provided to assess the generalisability of the findings. Sub-totals were calculated separately for internal validity (maximum 8) and external validity (maximum 15 for nitrites or 16 for leukocyte-esterase), and percentages of the maximum possible score were calculated. Estimates are presented with 95% confidence intervals (95% CI).

Potential sources of heterogeneity

For each publication detailed information was abstracted on: the colony count used to define UTI (cut-off used for the criterion standard), exclusion criteria, setting, level of care, symptomatic or asymptomatic bacteriuria, population sampled (children, general population, pregnant women, etc), age of the study population, urine-collection procedures, whether only first voided urine was collected, micro-organisms, procedures followed when urine was contaminated, duration of transport of the urine sample to the laboratory for culture, visual or automatic reading, and person who was reading the dipstick. In addition, information was collected on the year of study, disease prevalence at the setting, sample-size, country in which the study was performed, brand of dipstick and language of publication.

Meta-analysis

Data on sensitivity and specificity were derived from the original publications. If absolute data were not presented, published sensitivity and specificity data were used to reconstruct two-by-two tables. Sensitivity and specificity were pooled after natural logarithmic transformation. The average predictive values were calculated on the basis of geometric means of sensitivity and specificity using the weighted mean prevalence in the sub-group of studies at issue. The diagnostic odds ratio (DOR) of each individual study was calculated according to the following formula [20, 21]:

The DOR represents the ratio of the odds of a positive test result in the diseased group to the odds of a positive test result in the non-diseased group. A DOR of 1 means that the test has no discriminative power. When the DOR is more than one, the odds of a positive test result are higher in the diseased population. Pooling of the DOR was also performed after natural logarithmic transformation [ln(DOR)].

The statistical heterogeneity of sensitivity, specificity and the ln(DOR) across studies was tested by a χ2 test of independency with k-1 degrees of freedom (k = number of studies) [22]. As the validity of weighting by the inverse of the variance of the DOR is still under debate for meta-analysis of diagnostic studies [23], only the results of fixed unweighted pooling are presented. Outliers were detected by means of the Galbraith plot [24]. If a factor was significantly associated with outlying results (according to logistic regression), all studies with that factor were excluded from further analysis.

In case of negatively associated pairs of sensitivity and specificity, and a homogeneous ln(DOR), a regression line was fitted as a Summary ROC curve (SROC) [20, 25] in a scatter plot of the various studies included, with their sensitivity on the y-axis and (1 – specificity) on the x-axis. If sensitivity and specificity are negatively associated, it may be assumed that they represent a single DOR and that any variation between the pairs is caused by the use of different cut-off points for the test across studies. Dependency of the ln(DOR) on the cut-off point (S) can be tested using meta-regression analysis:

ln(DOR) = α + βS

If pairs of sensitivity and specificity still showed weak negative or no association, and if sensitivity or specificity was heterogeneous, sub-group analyses of the ln(DOR) were performed by means of ANOVA. All individual validity criteria, and all pre-defined potential sources of heterogeneity mentioned above, were used for sub-group analyses. Association with continuous variables was tested in univariate meta-regression analysis of the ln(DOR).

After sub-group analyses, all sources of heterogeneity associated with ln(DOR) up to p = 0.25 were selected for a multiple meta-regression analysis, to study the presence of independent factors associated with the ln(DOR).

Analyses were performed with SPSS 7.5 for Windows95 and with Meta-test [26]. For a more detailed description of the model used in this analysis, reference is made to Midgette et al. [27] and Devillé et al [28].

The accuracy of the dipstick test for nitrite and leukocyte-esterase was studied both separately and in combination: positive results for either nitrites or leukocyte-esterase or for both.

Results

Literature search

The search strategy identified 220 publications, of which 70 [2998] met the inclusion and exclusion criteria. Five selected publications [9498] were only detected by the search in EMBASE (n = 1), by reference tracking (n = 1) or personal contacts (n = 3). See Table 6 [see Additional file 2] for the main characteristics of the publications included. 150 publications were excluded from meta-analysis for the following reasons: they did not report on the accuracy of the urine dipstick test (nitrites and/or leukocyte esterase) for the diagnosis of UTI or bacteriuria (n = 99), they were reviews (n = 22), they did not use culture as a criterion standard (n = 6), they did not base test positivity only on nitrites and/or leukocyte esterase (n = 6), or they did not include sufficient data to calculate the diagnostic two by two table (n = 17). The 70 selected publications represent studies from 18 countries in five continents, published in seven languages. Two selected publications present the results of two different studies. Therefore, 72 different studies were included, 17 of which studied nitrites only, and 2 studied leukocyte-esterase only. The other studies evaluated different combinations of both.

Quality and applicability of studies

The mean score for internal validity was 72% (95% CI 69 to 75). Nine publications used the culture on dipslide as a criterion standard. Nine publications (13%) concerned double-blind studies; only two were clearly hampered by verification bias. In 65% of the studies the dipstick test was evaluated with the help of clinical information.

The mean score for external validity was 69% (95% CI 65 to 73). Some outpatient departments provided care at primary level, resulting in 15 primary care studies (21%). 17 studies (24%) did not provide details about the general population studied. Sixty percent of the studies did not mention any exclusion criteria; 20% gave no information on the way in which urine was collected, and 86% did not state whether first-voided urine was collected. Information on mixed or contaminated cultures was not available in over 50% of the studies (details are available on request).

Meta-analysis

Nitrites (n = 58)

Sensitivity and specificity were poorly correlated (Spearman ρ = -0.377) and highly heterogeneous (Q = 776 and 9609, respectively, df 57). So was the ln(DOR) (Q = 145, df 57). On the Galbraith plot, 22 studies were outside the 95% bounds (+/-2Z) from the standardised mean ln(DOR). Univariate logistic regression revealed an association of outliers with lower categories of internal and external validity (internal ≤ 50%: OR = 15.9, 95% CI 1.1 to 233.2, external ≤ 75%: OR = 4.2, 95% CI 1.2 to 15.1). Therefore, studies in the lowest categories of internal or external validity (≤50%) were excluded from further sub-group analysis and meta-regression (n = 12, references: [36, 37, 44, 62, 71, 72, 77, 78, 82, 83, 88, 89]).

The ln(DOR) remained heterogeneous (Q = 125, df 45). Univariate sub-group analyses revealed statistically significant differences in the ln(DOR) between several sub-categories of internal validity (blinding and prospective versus retrospective data collection) and external validity (types of patient population and care setting) (Table 1).

Table 1 Results of subgroup analyses and accuracy of nitrites in urine dipsticks for the diagnosis of urinary tract infections or bacteriuria according to several predefined study characteristics (subgroups of studies without information about the study characteristic at issue are not shown) (no. of studies: 46)

The ln(DOR) was also univariately associated with the cut-off point of the dipstick used in the evaluations (β = -0.439, 95% CI -0.606 to -0.272), pre-test probability (β = -4.54, 95% CI -6.499 to -2.082) and year of publication (β = -0.197, 95% CI -0.197 to -0.013).

Further analysis within sub-groups showed the following results:

  • blinding: only in double-blind studies were sensitivity and specificity found to be highly negatively correlated (ρ = -0.647) with a homogeneous ln(DOR). In unblinded studies the ln(DOR) was associated with the cut-off point for a positive result of the dipstick, and in single blind studies it was associated with both the cut-off point and the general population;

  • patient populations: sensitivity and specificity were highly negatively correlated in studies involving general populations (ρ = -0.539), pregnant women (ρ = -0.559) and surgery patients (ρ = -1.00), resulting in a homogeneous ln(DOR). In multiple meta-regression, the ln(DOR) for studies in general populations was associated with the cut-off point of the dipstick, supra-pubic urine-collection and automatic or visual reading. For studies in pregnant women it was associated with the presence of clinical information, and for studies in children it was associated with the cut-off point of the dipstick only;

  • care setting: strong negative correlations existed between sensitivity and specificity in family practices (ρ = -0.714) and emergency departments (ρ = -0.400) with a homogeneous ln(DOR) in both sub-groups. In multivariate meta-regression analysis, the ln(DOR) was associated in family practices with the pre-test probability; in outpatient departments it was associated with the cut-off point of the dipstick, and in inpatient departments with the cut-off point, pre-test probability, automatic or visual reading, and the presence of clinical information.

Multiple meta-regression analysis of all studies revealed an independent association of the ln(DOR) with the cut-off point of the dipstick (β = -0.348, 95% CI -0.505 – -0.192), studies executed in pregnant women (β = 1.082, 95% CI 0.178 – 1.985), in general populations (β = -0.772, 95% CI -1.601 – 0.057) or in elderly people (β = 1.457, 95% CI 0.022 – -2.882) (adjusted R2 regression model: 0.55).

For details on sensitivity, specificity, odds ratios and predictive values, see Table 1. Post-test probabilities at different pre-test probabilities for different patient populations and care settings are shown in Table 4, and Figure 1 and 2.

Table 4 Post-test probabilities (predictive values) of dipstick nitrites, leucyte-esterase and combinations of both tests in population sub-groups and different settings, based on pooled sensitivities, pooled specificities and pooled pre-test probabilities (prevalences).
Figure 1
figure 1

Predictive value (posttest probability) of positive and negative test results of respectively nitrites, leucocyte-esterase (only non-urological patients) and combination of both tests with at least one positive for the diagnosis of bacteriuria or UTI in different settings (for sensitivity and specificity values see Tables 1 to 3).

Figure 2
figure 2

Predictive value (posttest probability) of positive and negative test results of respectively nitrites, combination of both tests with at least one positive and combination of both tests with both tests positive for the diagnosis of bacteriuria or UTI in different populations (for sensitivity and specificity values see Tables 1 to 3).

Leukocyte-esterase (n = 42)

On the Galbraith plot 10 studies were outside the 95% bounds (+/-2Z) from the standardised mean ln(DOR). Univariate logistic regression revealed an association of outlier studies with lower categories of external validity (external ≤ 50%: OR = 32, 95% CI 2.3 to 447). Studies in the lowest category (≤50%) of internal validity (n = 1, reference: [81]) and external validity (n= 6, references: [37, 62, 71, 80, 83, 89]) were excluded from the analysis. Sensitivity and specificity were correlated after exclusion (Spearman ρ = -0.635), but remained heterogeneous (Q = 368 and 1799, df 34), as did the ln(DOR) (Q = 64, df 34).

Univariate sub-group analyses showed statistically significant differences in the ln(DOR) between sub-categories of external validity (Table 2): disease (UTI versus bacteriuria), type of patient population, care setting, method of urine-collection, reported exclusion criteria, and brand of dipstick. The ln(DOR) was not associated with the cut-off point for a positive leucocyte-esterase test. Further analysis of sub-groups showed that sensitivity and specificity were strongly negatively correlated in the non-urology studies (ρ = -0.798), as well as in the two urology studies (ρ = -1.00) resulting in a homogeneous ln(DOR) in the non-urology sub-group. Multiple meta-regression analysis in the non-urology studies showed an association of the ln(DOR) with the cut-off point of the dipstick, the disease and the family physician reading the test, but not with setting of care. At this level an interaction existed between disease and family physician reading the test (adjusted R2 regression model: 0.42). All other associations disappeared.

Table 2 Results of subgroup analyses and accuracy of leucocyte-esterase in urine dipsticks in the diagnosis of urinary tract infections or bacteriuria. (subgroups of studies without information about the study characteristic at issue are not shown) (no. of studies: 35)

For details on sensitivity, specificity, odds ratios and predictive values, see Table 2. Post-test probabilities for different care settings are shown in Table 4 and Figure 1.

Nitrite and leucocyte-esterase: one or both positive (n = 39)

Eleven studies were outliers; low internal validity (n = 3, references: [29, 36, 82]) and supra-pubic urine-collection (n = 1, reference: [49]) were associated with outlying results: these studies were excluded. Sensitivity and specificity were weakly correlated (Spearman ρ = -0.227), and both remained heterogeneous. The ln(DOR) was homogeneous (Q = 41, df 34).

The ln(DOR) was univariately associated with the cut-off point of the criterion standard, the availability of clinical information, population groups and brand of dipstick (Table 3). Sensitivity and specificity were negatively correlated in the sub-group of the general population (ρ = -0.406), in children (ρ = -0.417), surgery patients (ρ = -1.0) and urology patients (ρ = -0.50). Sensitivity was homogeneous in pregnant women, surgery and urology, specificity was homogeneous in the later two groups. The ln(DOR) was homogeneous in all population groups.

Table 3 Results of subgroup analyses and accuracy of combinations of both tests of nitrites and leucocyte-esterase in urine dipsticks in the diagnosis of urinary tract infections or bacteriuria (subgroups of studies without information about the study characteristic at issue are not shown).

Multivariate regression analysis retained the following independent factors: a cut-off point for the criterion standard of 1000 mcu/ml (1 study only, β = -1.823, 95% CI -3.629 – -0.017), studies in children (β = 1.176, 95% CI 0.477 – 1.875), studies in urology patients (β = 1.184, 95% CI 0.103 – 2.264) and the presence of clinical information (β = 0.893, 95% CI 0.259 – 1.527) (adjusted R2 regression model: 0.39). The model did not change when excluding the one study with the low criterion standard cut-off point (1000 mcu/ml).

For details on sensitivity, specificity and odds ratios, see Table 3. Post-test probabilities for different patient populations and care settings are shown in Table 4, and Figure 1 and 2.

Nitrite and leucocyte-esterase positive (n = 14)

Four studies were outliers, of which two had low external validity. As no factor was associated with the outliers, no studies were excluded. Sensitivity and specificity were negatively correlated (Spearman ρ = -0.275), and were both heterogeneous, as was the ln(DOR) (Q = 43, df 13).

The diagnostic odds ratio was associated with the cut-off point of the criterion standard and with population groups (Table 3). It was also associated with the cut-off point of the dipstick (β = -0.421, 95% CI -0.071 to -2.308), because of one study [46] that used a cut-off point of 1000 mcu/ml for the criterion standard. Sensitivity and specificity were negatively associated after exclusion of this study (Spearman ρ = -0.36), but remained heterogeneous, as did the ln(DOR). The ln(DOR) was only homogeneous in studies on children (Q = 9, df 5, Spearman ρ = -0.49).

In multivariate meta-regression, the independent factors were: studies in general populations, studies in surgery patients and one study with a criterion cut-off of 1000 mcu/ml. When excluding this last-mentioned study, studies in general populations (β = -2.312, 95% CI -3.950 to -0.675) and studies in surgery patients (β = -2.846, 95% CI -5.435 to -0.257) remained in the regression model (adjusted R2 regression model: 0.50).

For details on sensitivity, specificity and odds ratios see Table 3. Post-test probabilities for different patient populations are in Table 4 and Figure 2.

Discussion

Quality of the evidence

Before discussing the accuracy of the dipstick itself, one must take into account the amount and quality of the available evidence. The search was extensive, and identified a large number of studies published during the nineties. The quality of the research, as could be derived from the publications, was reasonable: 70% of the selected studies had an internal validity score which was approximately 70% of the maximum score. Only one in three publications had an external validity score of 75% or more. The importance of internal and external validity becomes clear from the fact that low scores were predominantly found among the outliers in this meta-analysis. A good description of the study population, using explicit selection criteria, is important: a major part of the existing heterogeneity in this meta-analysis could be explained by differences between study populations. The majority of the publications gave no information on important factors (such as the handling of contaminated samples or mixed cultures, or the micro-organisms cultured), which did not facilitate evaluation.

Nitrites

Overall, the sensitivity of the urine dipstick test for nitrites was low (45 – 60% in most situations) with higher levels of specificity (85 – 98%). The typically low pre-test probabilities resulted in high predictive values of negative test results. The test for nitrites had its highest accuracy in specific populations such as pregnant women, urology patients and elderly people. Only in the elderly did the test for nitrites reach a high sensitivity, while in pregnant women sensitivity was the lowest, confirming the results reported by Patterson [5]. Although statistically not significant, the test for nitrites might perform better in asymptomatic patients and in patients who are not on antibiotics, confirming the results reported by Beer [14].

In multivariable analysis the accuracy of the dipstick for nitrites was affected only by the cut-off point for the nitrites and the population tested. The differences between the studies with regard to implicit cut-off points may be effected by human, instrumental or environmental factors.

Patient populations and care setting were highly correlated. Pre-test probabilities differed between some levels of care. While it is often expected that pre-test probability increases with each level of the health care system, in this study it was found to be higher in family physician or primary care studies, compared to hospital studies. Family physicians apparently use the dipstick test to diagnose an infection based on clinical signs and symptoms, while hospital-based physicians order a dipstick test to screen patients to exclude the presence of an infection.

Leukocyte-esterase

Sensitivity of the urine dipstick test for leukocyte-esterase was, in general, slightly higher than for the dipstick test for nitrites (48 – 86%), while the specificity was slightly lower (17 – 93%). Generally, this resulted in a lower accuracy, compared to the test for nitrites, lower predictive values of positive test results and similar predictive values of negative test results.

The heterogeneity of the results of the urine dipstick test for leukocyte-esterase was only caused by factors related to external validity. Accuracy was higher for the detection of symptomatic UTI, compared to asymptomatic bacteriuria, as opposed to the test for nitrites. The leucocyte-esterase test had a much higher accuracy in urology patients, and consequently also in tertiary care, and when using a catheter for urine-collection. Sensitivity is highest in primary care, but requires further diagnostic work-up because of the high rates of false positives. In primary care negative results do not exclude the presence of infection.

Combination of nitrites and leukocyte-esterase

Combining the results of both parts of the dipstick tests with one or both showing a positive result increased sensitivity (68 to 88%), but had different effects on specificity. The considerable false positive rates weigh upon the predictive values of positive test results, as reported earlier [10]. This resulted in different effects on accuracy, but increased the predictive values of a negative test result in all study populations, except studies in general populations. A negative dipstick test result excluded the presence of infection in most studies, contrary to the findings of Hurlbut et al. [13]. Accuracy was highest in urology patients, surgery patients and in children. No differences were found between symptomatic UTI and asymptomatic bacteriuria, as was reported by Pelgrom [12]. When both tests were positive specificity increased, also raising the predictive value of a positive result to an acceptable level in general populations.

Recommendations for practice

Care setting and patient population are the major sources of heterogeneity. Consequently, these factors should be taken into account for optimal test use in different clinical circumstances. In the general population a negative test result for one of both tests has a sufficient predictive value to exclude disease, and when both test results are positive there is sufficient evidence to rule in infection. Also in children, pregnant women, surgery or urology populations a negative result for both tests rules out infection, while a positive nitrite test still needs working-up, although the probability of infection increases considerably. In the elderly a negative test result for both tests rules out infection, while a positive nitrite test rules in infection. Post-test probabilities of positive leucocyte-esterase are low in all population subgroups.

A family physician should take these considerations in specific population groups into account, but in non-specific patients in a general practice a positive nitrite test rules in infection. On the other hand, if both tests are available and one of them is negative, confirmation remains necessary, because of the amount of false positive results. In other settings clinicians may exclude infection on the basis of a single negative test result.

Criterion standard

For nitrites and leukocyte-esterase both separately or combined, the use of a more stringent definition of infection by increasing the cut-off point of the culture raised accuracy significantly. The lower cut-off point, at less than 1,000 mcu/ml, used mainly in supra-pubic urine-collection, resulted for nitrites in a higher accuracy through higher sensitivities. The present findings do not demonstrate systematically higher false positive rates with more stringent definitions of infection, as was observed by Gorelick [1]. The lowest cut-off point had higher false positive rates, but not the cut-off point at 105 mcu/ml.

Conclusions

Research in this field can still be improved by implementing clear inclusion and exclusion criteria, and by double-blind study designs. Reporting on the distribution of micro-organisms, the way in which urine is collected, the time delay between collection and analysis, whether only first-voided urine was collected, the handling of mixed cultures and contaminated urine samples, and who was reading the test, may improve future systematic reviews of test accuracy. If sample-sizes are adequate, the publication of results for relevant sub-groups may also increase the quality of future diagnostic studies in this field. Although this meta-analysis covers the evidence published over the last decade, the validity of its results is also limited by the limited specifications given in the publications. As specific patient populations – a proxy-indicator for spectrum of disease – seem to be the major source of heterogeneity of accuracy, more details about patients in different clinical settings might increase the validity of a future meta-analysis.

Overall, this review demonstrates that the urine dipstick test alone seems to be useful in all populations to exclude the presence of infection if the results for nitrites or leukocyte-esterase are negative. Sensitivities of the combination whereby one or both test results are positive vary between 68 and 88% in different patient groups, but positive test results have to be confirmed or pre-test probabilities have to be high on the basis of the clinical history and/or a combination of other tests. In family practice, the combination of both tests with at least one positive result is very sensitive, but because of its low specificity remains the usefulness of the dipstick test alone doubtful, even with high pre-test probabilities.