Background

Lyme borreliosis is one of the most prevalent vector-borne diseases in Europe. Its incidence varies between countries, with approximately 65,500 patients annually in Europe (estimated in 2009) [1]. It is caused by spirochetes of the Borrelia burgdorferi sensu lato species complex, which are transmitted by several species of Ixodid ticks [2]. In Europe, at least five genospecies of the Borrelia burgdorferi sensu lato complex can cause disease, leading to a variety of clinical manifestations including erythema migrans (EM), neuroborreliosis, arthritis and acrodermatitis chronica atrophicans (ACA). Each of these clinical presentations can be seen as a distinct target condition, i.e. the disorder that a test tries to determine, as they affect different body parts and different organ systems, and because the patients suffering from these conditions may enter and travel through the health care system in different ways, hence following different clinical pathways.

The diagnosis of Lyme borreliosis is based on the presence of specific symptoms, combined with laboratory evidence for infection. Laboratory confirmation is essential in case of non-specific disease manifestations. Serology is the cornerstone of Lyme laboratory diagnosis, both in primary care and in more specialized settings. Serological tests that are most often used are enzyme-linked immunosorbent assays (ELISAs) or immunoblots. ELISAs are the first test to be used; immunoblots are typically applied only when ELISA was positive. If signs and symptoms are inconclusive, the decision may be driven by the serology test results. In such a situation, patients may be treated with antibiotics after a positive serology result – a positive ELISA possibly followed by a positive immunoblot. After negative serology – a negative ELISA or a positive ELISA followed by a negative immunoblot – patients will not be treated for Lyme borreliosis, but they will be followed up or referred for further diagnosis. This implies that false positively tested patients (who have no Lyme borreliosis, but have positive serology) will be treated for Lyme borreliosis while they have another condition. It also implies that falsely negative tested patients (who have the disease, but test negative) will not be treated for Lyme borreliosis. A test with a high specificity – which is the percentage true negative results among patients without the target condition – will result in a low percentage of false positives. A test with a high sensitivity – being the percentage true positives among patients with the target condition – will result in a low percentage of false negatives.

The interpretation of serology results is complicated. The link between antibody status and actual infection may not be obvious: non-infected people may have immunity and test positive, while infected people may have a delay in their antibody response and may test negative. Furthermore, there is an overwhelming number of different available assays that have all been evaluated in different patient populations and settings and that may perform differently for the various disease manifestations [3]. We therefore systematically reviewed all available literature to assess the accuracy (expressed as sensitivity and specificity) of serological tests for the diagnosis of the different manifestations of Lyme borreliosis in Europe. Our secondary aim was to investigate potential sources of heterogeneity, for example test-type, whether the test was a commercial test or an in-house test, publication year and antigens used.

Methods

We searched EMBASE and Medline (Appendix 1) and contacted experts for studies evaluating serological tests against a reference standard. The reference standard is the test or testing algorithm used to define whether someone has Lyme borreliosis or not. We included studies using any reference standard, but most studies used clinical criteria, sometimes in combination with serology. Studies performed in Europe and published in English, French, German, Norwegian, Spanish and Dutch were included.

The ideal study type to answer our question would be a cross-sectional study, including a series of representative, equally suspected patients who undergo both the index test and the reference standard [4]. Such studies would provide valid estimates of sensitivity and specificity and would also directly provide estimates of prevalence and predictive values. However, as we anticipated that these cross-sectional studies would be very sparse, we decided to include case-control studies or so-called two-gate designs as well [5]. These studies estimate the sensitivity of a test in a group of cases, i.e. patients for whom one is relatively sure that they have Lyme borreliosis. They estimate the specificity in a group of controls, i.e. patients of whom one is relatively sure that they do not have Lyme borreliosis. These are healthy volunteers, or patients with other diseases than Lyme.

We included studies on ELISAs, immunoblots, two-tiered testing algorithms of an ELISA followed by an immunoblot, and specific antibody index measurement (calculated using the antibody titers in both serum and cerebrospinal fluid). We excluded indirect fluorescent antibody assays, as these are rarely used in practice. Studies based on make-up samples were excluded. We also excluded studies for which 2 × 2 tables could not be inferred from the study results.

For each article, two authors independently collected study data and assessed quality. We assessed the quality using the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) checklist. This checklist consists of four domains: patient selection, index test, reference standard and flow and timing [6]. Each of these domains has a sub-domain for risk of bias and the first three have a sub-domain for concerns regarding the applicability of study results. The sub-domains about risk of bias include a number of signalling questions to guide the overall judgement about whether a study is highly likely to be biased or not (Appendix 2).

We analysed test accuracy for each of the manifestations of Lyme borreliosis separately and separately for case-control designs and cross-sectional designs. If a study did not distinguish between the different manifestations, we used the data of this study in the analysis for the target condition “unspecified Lyme”. Serology assays measure the level of immunoglubulins (Ig) in the patient’s serum. IgM is the antibody most present in the early stages of disease, while IgG increases later in the disease. Some tests only measure IgM, some only IgG and some tests measure any type of Ig. In some studies, the accuracy was reported for IgM only, IgG only and for detection of IgG and IgM. In those cases, we included the data for simultaneous detection of both IgG and IgM (IgT).

We meta-analyzed the data using the Hierarchical Summary ROC (HSROC) model, a hierarchical meta-regression method incorporating both sensitivity and specificity while taking into account the correlation between the two [7]. The model assumes an underlying summary ROC curve through the study results and estimates the parameters for this curve: accuracy, threshold at which the tests are assumed to operate and shape of the curve. Accuracy is a combination of sensitivity and specificity; the shape of the curve provides information about how accuracy varies when the threshold varies. From these parameters we derived the reported sensitivity and specificity estimates. We used SAS 9.3 for the analyses and Review Manager 5.3 for the ROC plots.

There is no recommended measure to estimate the amount of heterogeneity in diagnostic accuracy reviews, but researchers are encouraged to investigate potential sources of heterogeneity [7]. The most prominent source of heterogeneity is variation in threshold, which is taken into account by using the HSROC model. Other potential sources of heterogeneity are: test type (ELISA or immunoblot); a test being commercial or not; immunoglobulin type; antigen used; publication year; late versus early disease; and study quality. These were added as covariates to the model to explain variation in accuracy, threshold or shape of the curve.

Some studies reported results for patients with “possible Lyme” (i.e. no clear cases, neither clear controls). We included these as cases. As this may lead to underestimation of sensitivity, we investigated the effect of this approach. Borderline test results were included in the test-positive group.

Results

Selection and quality assessment

Our initial search in January 2013 retrieved 8026 unique titles and a search update in February 2014 revealed another 418 titles. After careful selection by two authors independently (ML, HS) we read the full text of 489 studies, performed data-extraction on 122 studies and finally included 75 unique published articles (Fig. 1). Fifty-seven of these had a case-control design, comparing a group of well-defined cases with a group of healthy controls or controls with diseases that could lead to cross-reactivity of the tests [864]. Eighteen had a cross-sectional design in which a more homogeneous sample of patients underwent both the serological assay(s) and the reference standard [6582]. Three studies were not used in the meta-analyses, either because they used immunoblot as a reference standard [76, 79], or included asymptomatic cross-country runners with high IgG titers as controls [47].

Fig. 1
figure 1

Results of the search and selection process

None of the studies had low risk of bias in all four QUADAS-2 domains (Fig. 2 and Tables 1 and 2). Forty-six out of the 57 case-control studies and six out of the 18 cross-sectional studies scored unclear or high risk of bias in all four domains. All case-control studies had a high risk of bias for the patient sampling domain, because these designs exclude all “difficult to diagnose” patients [83]. Only three studies reported that the assessment of the index test was blinded for the disease status of the participants [45, 66, 75]. The cut-off value to decide whether a test is positive or negative was often decided after the study was done, which may also lead to bias in the index test domain [84]. The most common problem was inclusion of the serology results in the reference standard. The flow and timing domain was problematic in all case-control studies, as the cases and controls are usually verified in different ways. Three studies reported potential conflict of interest [31, 39, 62]. Most studies had a high concern regarding applicability, which means that either the included patients or the test used are not representative for clinical practice. Only three studies were representative for all domains [65, 73, 81].

Fig. 2
figure 2

Methodological quality graph. Review authors’ judgements about each methodological quality item presented as percentages across all included studies. On the left-hand side the judgements for the included case control studies; and on the right-hand side those for the included cross-sectional studies. RoB: Risk of Bias; CrA: Concerns regarding applicability; P: patient sampling; I: Index test; RS: Reference Standard; TaF: timing and flow

Table 1 Quality assessment of included case control studies
Table 2 Quality assessment of included cross-sectional control studies

Meta-analyses

Erythema migrans

Nineteen case-control studies including healthy controls evaluated the accuracy of serological tests for EM. The summary sensitivity for ELISA or immunoblot detecting EM patients was 50 % (95 % CI 40 % to 61 %) and specificity 95 % (95 % CI 92 % to 97 %). ELISA tests had a higher accuracy than immunoblots (P-value = 0 · 008), mainly due to a higher sensitivity (Table 3). Commercial tests did not perform significantly different from in-house tests. The 23 case-control studies on EM including cross-reacting controls had similar results (data not shown). One cross-sectional study done on EM-suspected patients evaluated four different immunoblots in patients with a positive or unclear ELISA result; their sensitivity varied between 33 and 92 % and their specificity between 27 and 70 % [66].

Table 3 Summary estimates of sensitivity and specificity for all case-definitions, derived from a hierarchical summary ROC model. The results may be different from those in the main text, as here they are specified for immunoblots and ELISAs and for commercial and in-house tests separately, while in the main text the overall estimates are provided

Neuroborreliosis

Twenty case-control studies on neuroborreliosis included healthy controls. Their overall sensitivity was 77 % (95 % CI 67 % to 85 %) and their specificity 93 % (95 % CI 88 % to 96 %) (Fig. 3a). On average, ELISA assays had a lower accuracy than immunoblot assays (P = 0 · 042). The in-house ELISAs had the lowest specificity of all tests (Table 3). Twenty-six case-control studies with cross-reacting controls showed similar results, but with a lower specificity (data not shown). The ten cross-sectional studies for neuroborreliosis had a median prevalence of 50 % (IQR 37 % to 70 %). The summary sensitivity for any serological test done in serum was 78 % (95 % CI 53 % to 92 %) and specificity was 78 % (95 % CI 40 % to 95 %) (Fig. 3b). Whether a test was ELISA or immunoblot, commercial or in-house did not affect model parameters.

Fig. 3
figure 3

Raw ROC plots and fitted summary ROC curves. Every symbol reflects a 2 × 2 table, one for each test. Blue triangle = commercial EIA; Red diamond = in house EIA; Green rectangle = commercial IB; Black circle = in house IB. One study may have contributed more than one 2 × 2 table. The dots on the summary ROC curves reflect the summary estimate of sensitivity and specificity. a neuroborreliosis case-control studies including healthy controls. b: neuroborreliosis cross-sectional studies. c unspecified Lyme borreliosis case-control studies including healthy controls. d unspecified Lyme borreliosis cross-sectional studies. The size of the symbol reflects the sample size. For the cross-sectional studies, only the overall summary ROC curve is shown, while for the case-control designs the curves are shown for the different test-types

Lyme Arthritis

Meta-analysis was not possible for the eight case-control studies on Lyme arthritis with healthy controls. We therefore only report the median estimates and their interquartile range (IQR). Median sensitivity was 96 % (IQR 93 % to 100 %); median specificity was 94 % (IQR 91 % to 97 %) (Table 3). Three cross-sectional studies were done in patients suspected of Lyme arthritis; this was insufficient to do a meta-analysis [66, 71, 85].

Acrodermatitis chronica atrophicans

The nine case control studies on ACA including a healthy control group had a high summary sensitivity for any serological assay: 98 % (95 % CI 84 % to 100 %). Specificity was 94 % (95 % CI 90 % to 97 %). One study had an extremely low sensitivity for the in-house assay evaluated; most likely because one of the antigens used (OspC) is no longer expressed by the spirochetes in long-standing disease [45]. Test-type was not added to the analyses, because of insufficient data. Case-control studies for ACA including cross-reacting controls had a lower sensitivity and specificity than the healthy control designs (both 91 %).

Unspecified Lyme borreliosis

Thirteen case-control studies included unspecified Lyme borreliosis cases and healthy controls. Their summary sensitivity for any test was 73 % (95 % CI 53 % to 87) and specificity was 96 % (95 % CI 91 % to 99 %) (Fig. 3c). Commercial tests had a lower accuracy (P-value = 0 · 008), mainly due to a lower sensitivity (Table 3). Twelve studies including cross-reacting controls had a summary sensitivity of 81 % (95 % CI 64 % to 91 %) and specificity of 90 % (95 % CI 79 % to 96 %). Five cross-sectional studies aimed to diagnose an unspecified form of Lyme borreliosis (Fig. 3d). The prevalence varied between 10 and 79 %, indicating a varying patient spectrum. Sensitivity was 77 % (95 % CI 48 % to 93 %) and specificity 77 % (95 % CI 46 % to 93 %). There were insufficient data points to analyze test type.

Two-tiered tests

One case-control study investigated the diagnostic accuracy of two-tiered approaches for all manifestations and healthy controls [11]. The sensitivity of the European algorithms varied between 55 % for EM and 100 % for ACA. The specificity for all assays was ≥ 99 %. Another case-control study investigated 12 different algorithms for ‘late Lyme borreliosis’ and ‘early Lyme borreliosis’ [21]. Their sensitivity varied between 4 and 50 % and the specificity varied between 88 and 100 %. One case-control study including EM cases and healthy controls and evaluating two algorithms reported a sensitivity of 11 % or 43 % and a specificity of 100 % [14]. Two cross-sectional studies on two-tiered tests aimed at diagnosing neuroborreliosis [80, 81] and two at diagnosing unspecified Lyme borreliosis [67, 70]. Their prevalence varied between 19 and 77 %; their sensitivity between 46 and 97 %; and their specificity between 56 and 100 %.

Specific antibody index

Seven studies containing cross-reacting controls evaluated a specific antibody index for the diagnosis of neuroborreliosis. The summary sensitivity was 86 % (95 % CI 63 % to 95 %) and specificity 94 % (95 % CI 85 % to 97 %). The four cross-sectional studies had a summary sensitivity of 79 % (95 % CI 34 % to 97 %) and a summary specificity of 96 % (95 % CI 64 % to 100 %).

Heterogeneity

The IgG tests had a comparable sensitivity to the IgM tests, except for EM (IgM slightly higher sensitivity), Lyme arthritis and ACA (IgM much lower sensitivity in both). Tests assessing both IgM and IgG had the highest sensitivity and the lowest specificity, although specificity was above 80 % in most cases. (Table 4).

Table 4 Summary estimates of sensitivity and specificity for IgM versus IgG versus IGM or IgG (IgT)

We evaluated the effect of three antigen types: whole cell, purified proteins or recombinant antigens. In neuroborreliosis, recombinant antigens had both the highest sensitivity and specificity, while in unspecified Lyme, they had the lowest sensitivity and specificity. (Table 5) Year of publication showed an effect only for erythema migrans and neuroborreliosis: in both cases publications before the year 2000 showed a lower sensitivity than those after 2000. (Table 6) Antigen type and year of publication were not associated with each other.

Table 5 Generation of antigens
Table 6 Year of publication

For unspecified Lyme we were able to directly compare the accuracy in early stages of disease with the accuracy in later stages of disease. The tests showed a lower sensitivity and slightly higher specificity in the early stages of the disease. (Table 7).

Table 7 Early versus late Lyme borreliosis

We were able to meta-analyze manufacturer-specific results for only two manufacturers, but the results showed much variability and the confidence intervals were broad.

We investigated the effect of the reference standard domain of QUADAS-2: acceptable case definition versus no or unclear; and serology in the case definition versus no or unclear. None had a significant effect on accuracy. The study by Ang contained at least 8 different 2x2 tables for each case definition and may have therefore weighed heavily on the results [8]. However, sensitivity analysis showed that its effect was only marginal. The same was true for assuming possible cases as being controls and indeterminate test results as being negatives.

Discussion

Overall, the diagnostic accuracy of ELISAs and immunoblots for Lyme borreliosis in Europe varies widely, with an average sensitivity of ~80 % and a specificity of ~95 %. For Lyme arthritis and ACA the sensitivity was around 95 %. For EM the sensitivity was ~50 %. In cross-sectional studies of neuroborreliosis and unspecified Lyme borreliosis, the sensitivity was comparable to the case-control designs, but the specificity decreased to 78 and 77 % respectively. Two-tiered tests did not outperform single tests. Specific antibody index tests did not outperform the other tests for neuroborreliosis, although the specificity remained high even in the cross-sectional designs. All results should be interpreted with caution, as the results showed much variation and the included studies were at high risk of bias.

Although predictive values could not be meta-analyzed, the sensitivity and specificity estimates from this review may be used to provide an idea of the consequences of testing when the test is being used in practice. Imagine that a clinician sees about 1000 people a year who are suspected of one of the manifestations of Lyme borreliosis, in a setting where the expected prevalence of that manifestation is 10 %. A prevalence of 10 % would mean that 100 out of 1000 tested patients will really have a form of Lyme borreliosis. If these people are tested by an ELISA with a sensitivity 80 %, then 0.80*100 = 80 patients with Lyme borreliosis will test positive and 20 patients will test negative. If we assume a specificity of 80 % as well (following the estimates from the cross-sectional designs), then out of the 900 patients without Lyme borreliosis, 0.80*900 = 720 will test negative and 180 will test positive. These numbers mean that in this hypothetical cohort of 1000 tested patients, 80 + 180 = 260 patients will have a positive test result. Only 80 of these will be true positives and indeed have Lyme borreliosis (positive predictive value 80/260 = 0.31 = 31 %). The other 180 positively tested patients are the false positives and they will be treated for Lyme while they have another cause of disease, thus delaying their final diagnosis and subsequent treatment. In a two-tiered approach, all positives will be tested with immunoblot after ELISA. These numbers also mean that we will have 720 + 20 = 740 negative test results, of which 3 % (negative predictive value 720/740 = 0.97 = 97 %) will have Lyme borreliosis despite a negative test result. These are the false-negatives, their diagnosis will be missed or delayed. Although calculations like these may provide insight in the consequences of testing, they should be taken with caution. The results were overall very heterogeneous and may depend on patient characteristics. Also, the prevalence of 10 % may not be realistic. In our review, we found prevalences ranging from 1 to 79 % for unspecified Lyme borreliosis and ranging from 12 to 62 % for neuroborreliosis. Appendix 3 shows some more of these inferences, for different prevalence situations and different sensitivity and specificity of the tests.

Limitations of this review are the representativeness of the results, the poor reporting of study characteristics and the lack of a true gold standard. Most studies included were case-control studies. These may be easier to perform in a laboratory setting than cross-sectional designs, but their results are less representative for clinical practice. Also the immunoblot was not analysed in a way that is representative for practice: most immunoblots were analysed on the same samples as the ELISAs, while in practice immunoblots will only be used on ELISA-positive samples. EM patients formed the second largest group of patients in our review. The low sensitivity in this group of patients supports the guidelines stating that serological testing in EM patients is not recommended [86]. On the other hand, patients with atypical manifestations were not included in the reviewed studies, while this group of patients does form a diagnostic problem [87, 88]. A more detailed analyses of the included patients’ characteristics and test characteristics would have been nice, but these characteristics were poorly reported. This is also reflected in the Quality-assessment table, with many ‘unclear’ scores, even for more recent studies. Authors may not have been aware of existing reporting guidelines and we therefore suggest that authors of future studies use the STAndards for Reporting Diagnostic accuracy studies (STARD) to guide their manuscript [89].

There is no gold standard for Lyme borreliosis, so we used the reference standard as presented by the authors of the included studies. This may have added to the amount of variation. Furthermore, many of the investigated studies included the results from antibody testing in their definition of Lyme borreliosis, which may have overestimated sensitivity and specificity. However, this was not proven in our heterogeneity analyses.

The performance of diagnostic tests very much depends on the population in which the test is being used. Future studies should therefore be prospective cross-sectional studies including a consecutive sample of presenting patients, preferably stratified by the situation in which the patient presents (e.g. a tertiary Lyme referral center versus general practice). The lack of a gold standard may be solved by using a reference standard with multiple levels of certainty [90, 91]. Although this will diminish contrasts and will thus be more difficult to interpret, it does reflect practice in a better way. Other solutions may be more statistically derived approaches like latent class analysis, use of expert-opinion and/or response to treatment [92].

However, more and better designed diagnostic accuracy studies will not improve the accuracy of these tests themselves. They will provide more valid estimates of the tests’ accuracy, including predictive values, but the actual added value of testing for Lyme disease requires information about subsequent actions and consequences of testing. If the final diagnosis or referral pattern is solely based upon the clinical picture, then testing patients for Lyme may have no added value. In that case, a perfect test may still be useless if it does not change clinical management decisions [93]. On the other hand, imperfect laboratory tests may still be valuable for clinical decision making if subsequent actions improve the patient’s outcomes. The challenge for clinicians is to deal with the uncertainties of imperfect laboratory tests.

Conclusions

We found no evidence that ELISAs have a higher or lower accuracy than immunoblots; neither did we find evidence that two-tiered approaches have a better performance than single tests. However, the data in this review do not provide sufficient evidence to make inferences about the value of the tests for clinical practice. Valid estimates of sensitivity and specificity for the tests as used in practice require well-designed cross-sectional studies, done in the relevant clinical patient populations. Furthermore, information is needed about the prevalence of Lyme borreliosis among those tested for it and the clinical consequences of a negative or positive test result. The latter depend on the place of the test in the clinical pathway and the clinical decisions that are driven by the test results or not. Future research should primarily focus on more targeted clinical validations of these tests and research into appropriate use of these tests.

Availability of data and materials

The raw data (data-extraction results, reference lists, statistical codes) will be provided upon request by ECDC (info@ecdc.europa.eu).