Introduction

Low back pain is a global health and socio-economic problem [1], as it is the leading cause of disability and work absenteeism in the Western world [2]. When conservative measures fail, operative intervention can be considered. Spinal fusion is a surgical procedure in which rigid fixation of vertebral segments is achieved by means of osteosynthesis and bone grafting to create definite bony fusion of the vertebrae involved. Failed spinal fusion may occur in 30–40% of spinal fusion patients [3, 4]. Pseudarthrosis is defined as the absence of solid bony fusion at a minimum follow-up of 6 months after surgery [5, 6]. Pseudarthrosis can be associated with persistent or recurrent back and/or leg pain [7], but can also be asymptomatic [7,8,9]. Whether symptomatic or asymptomatic, pseudarthrosis increases the risk of material failure, late deformity, and neurological symptoms [10, 11].

Revision surgery is the preferred treatment in spinal fusion patients suffering from symptoms due to pseudarthrosis. Revision surgery is invasive, expensive, and may have a worse outcome than primary surgery [12, 13] and should only be performed when the pseudarthrosis diagnosis is irrefutable. Since symptoms of pseudarthrosis may be nonspecific and multiple individual sources of pain may contribute to the complex of symptoms [14], diagnostic tools are required to set the diagnosis. The gold standard for the diagnosis of pseudarthrosis is surgical exploration [5, 7, 15, 16], an invasive, costly, and nowadays rarely used test which is not desirable or ethical in patients without symptoms. The aim of the study was to determine the diagnostic accuracy of imaging modalities to detect pseudarthrosis after thoracolumbar spinal fusion, with surgical exploration as the reference standard.

Materials and methods

Identification of studies

This review was performed according to the PRISMA statement guidelines [17, 18]. A systematic literature search was conducted in the PubMed, EMBASE, and CINAHL databases from inception until February 2017 to identify relevant studies. A list of keywords and text words was formulated to describe the detection of pseudarthrosis by imaging as index test compared to surgical exploration as reference standard in patients after spinal fusion surgery. Terms for imaging: tomography, radiography, plain radiographs, MRI, CT, scintigraphy, SPECT, SPECT/CT, PET, PET/CT, DEXA. Terms for study design: diagnostic accuracy, precision, predictive value, sensitivity, specificity, false positive, false negative. Terms for patient population: spine, vertebrae, vertebral column, spinal fusion, spinal arthrodesis, spondylodesis, bone graft, pseudarthrosis, non-union, delayed union, clinical failure, surgical exploration, re-operation, second-look operation. The search was limited to the English language.

Once the search was completed, the resulting articles were checked for duplicates. Subsequently, two independent reviewers (PW, orthopedic surgeon with over 10 years of experience in spinal surgery and MP, junior researcher specialized in imaging) screened the identified citations to determine whether they met predefined in- and exclusion criteria. If disagreements could not be resolved by consensus, a third reviewer (CB, clinical epidemiologist with over 15 years of experience in conducting systematic reviews) was consulted. Only original studies that provided data to construct contingency tables were included. Exclusion criteria were spinal fusion for the indications bone fracture, tumor, infection; time interval between surgery and index test less than 6 months; patient population smaller than ten; cervical fusion; animal studies; in vitro studies.

Data extraction

Standard reference data, population characteristics, details on spinal fusion, index test, reference test, and time intervals were extracted by the reviewers (PW, MP). Disagreements were resolved by consensus. Besides study characteristics, diagnostic accuracy data was extracted. Since the outcome was dichotomous (diagnosis was either pseudarthrosis or fusion), contingency tables were constructed. We also recorded whether the results originated from per-patient-, per-level-, or per-side-based analysis.

Methodological quality

The methodological quality of each selected study was assessed independently by the reviewers according to the Quality Assessment for Diagnostic Accuracy Studies 2 (QUADAS-2) tool [19]. The QUADAS-2 tool consists of four key domains that discuss patient selection, index test, reference standard, flow of patients through the study, and timing of the index test and reference standard. Each domain was scored in terms of risk of bias and concerns regarding applicability to the research question. Disagreements were resolved by consensus.

Data synthesis and statistical analysis

Pseudarthrosis was defined as a positive test result and fusion as a negative test result. Diagnostic accuracy values were calculated from the extracted contingency tables. Continuity correction was applied to studies with zero-cell counts by adding 0.5 to all cells of the study [20]. Per index test, the studies describing that test were considered for inclusion into subgroup meta-analysis.

Inclusion in meta-analysis

Meta-analysis was only performed when studies evaluating the same modality were not significantly hampered by clinical heterogeneity. Studies were considered clinically heterogeneous when patient groups, outcome measures, and/or the execution of index tests were considerably different.

The random effect model was employed during meta-analyses to account for unobserved sources of variation [21]. The odds ratio (OR) was used as the principal summary measure in meta-analysis. The higher the OR, the better the discriminatory performance. An OR of 1 indicates a test that does not discriminate between patients with pseudarthrosis and patients with fusion [22]. An OR below 1 suggests a negative association between index test and surgical exploration. Analyses were performed using the Stata statistical software package, version 14.1 (StataCorp, College Station, TX, USA).

Results

Identification of studies

One hundred sixty-five potentially relevant references were identified through database search. After screening, 15 studies were included in this review, reporting on eight modalities: plain radiography, flexion extension radiography (FE radiography), computed tomography (CT), single-photon emission computed tomography (SPECT), planar scintigraphy, polytomography, ultra sound/sonography (US) and 18F-fluoride positron emission tomography/computed tomography (PET/CT). The study selection flowchart is detailed in Fig. 1. The level of evidence of the included studies ranged from I to III.

Fig. 1
figure 1

Flowchart showing the selection of studies from electronic search (identification) until inclusion in the subgroup meta-analyses. Initially, 165 potentially relevant references were identified through database search. One hundred thirty-two were obtained for further screening after removal of 33 duplicates. After removal based on title and abstract screening, the full text of 35 articles was screened and their reference sections were scanned for additional eligible studies. Hereafter, 15 studies were included this review, reporting on eight modalities. The meta-analysis part at the bottom of the figure will be discussed in ‘inclusion in meta-analysis’, which can be found hereafter in the result section. * 3 of the 15 studies described 2 to 4 modalities, leading to 22 included items

Data extraction

Study characteristics of the 15 included studies are listed in Table 1. The number of levels fused in a single patient during initial surgery ranged from 1 to 13 levels. Eight articles monitored pseudarthrosis per patient, five monitored each level separately, and two made a distinction between the left and right side of each operated level. All articles reported that persistent low back pain and/or suspicion of pseudarthrosis was the reason for surgical exploration. The time interval between initial surgery and surgical exploration ranged from 6 to 120 months.

Table 1 Study characteristics

Methodological quality assessment

Table 2 displays the quality assessment according to QUADAS-2. An overview of the distribution of QUADAS-2 scores is presented in Fig. 2. Risk of bias on ‘flow and timing’, ‘patient selection’, ‘index test’, and ‘reference standard’ was classified as high or unclear in 58% of cases. Common weaknesses related to poor documentation of patient selection and description of the reference standard. Two studies were considered to have low risk of bias in all four domains. Concerns of applicability on ‘patient selection’, ‘index test’, and ‘reference standard’ was classified as high or unclear in 42% of cases. Three studies were considered to suffer from low applicability concerns over all three domains.

Table 2 QUADAS-2 results for the 15 studies included in this review
Fig. 2
figure 2

Stacked bar charts of QUADAS-2 scores presenting a quick overview of the methodological quality of the 15 included studies, expressed as a percentage of studies that met each criterion. For each quality domain, the proportion of included studies that suggest low, high, or unclear risk of bias and/or concerns regarding applicability are displayed in green, orange, and blue, respectively

Data synthesis and statistical analysis

Table 3 shows the diagnostic accuracy values of the included studies, grouped per index test.

Table 3 Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), positive and negative likelihood ratios (LR+, LR-), prevalence of pseudarthrosis, accuracy ((true positive + true negative) / (total)) and OR values with corresponding 95% confidence intervals for the seven index tests

Inclusion in meta-analysis

The studies discussing the index tests SPECT [24, 28] and planar scintigraphy [14, 23, 30] were considered for inclusion into subgroup meta-analysis further referred to as scintigraphy. McMaster et al. was not included because the time interval between fusion surgery and surgical exploration was deviating too much from the other studies. The remaining four studies were pooled.

Six studies were considered for inclusion in meta-analysis for plain radiography [14, 15, 26, 27, 31, 32]. Fogel et al. was excluded since the low prevalence of pseudarthrosis made the study population incomparable to the other studies (see Table 3). The remaining five studies were considered comparable enough to be pooled. Two articles diagnosed pseudarthrosis per patient [14, 26], two per level [27, 31], and one per side [15]. We chose to pool these studies despite differences in analysis region since we were mainly interested in the correlation between findings on imaging and surgical exploration. Using the same rationale, no distinction was made between studies on posterolateral and interbody fusion.

Two articles were considered for FE radiography meta-analysis [14, 15]. Apart from differences in analysis regions, the study characteristics were considered comparable and the studies were therefore pooled.

Six articles were considered for inclusion in CT meta-analysis [14,15,16, 25, 32, 33]. The study of Brodsky et al. was excluded for lack of sagittal and coronal reconstructions, essential in the assessment of interbody bony fusion [14, 16, 33, 35]. Laasonen et al. and Larsen et al. were excluded on slice thickness. Thicknesses of 5 and 6 mm were used respectively, while bony bridging should be assessed using thin slice CT to be reliable [16, 32, 33, 35]. Fogel et al. was excluded for low prevalence of pseudarthrosis compared to the other studies. The posterolateral fusion patient group of Carreon et al. [16] and the interbody fusion patient group Carreon et al. [33] were pooled for CT.

Figure 3 shows a forest plot of the studies selected for subgroup meta-analysis, with their respective weights and resulting pooled ORs. Index tests for which only one study was identified, i.e., US, polytomography, 18F-fluoride PET/CT [15, 29, 34], could inevitably not undergo subgroup meta-analysis. These single studies were, however, evaluated on the same grounds and if considered reliable, included in Table 4 to complement the meta-analysis results. This was only the case for the study on polytomography [15]. For the study on US [29], the authors considered that with the evaluation of ten patients only, US was not investigated thoroughly enough for pseudarthrosis detection. In the 18F-fluoride PET/CT study [34], the reference standard was either surgical exploration or clinical follow-up, based on the index test outcome. This introduced a bias in the patient population that underwent surgical exploration; only patients with a suspicion of pseudarthrosis on 18F-fluoride PET/CT were surgically explored and used to calculate diagnostic accuracy.

Fig. 3
figure 3

Forest plot of the included studies in the meta-analysis per modality. The size of each square is proportional to the study’s weight

Table 4 Overview of ORs as determined from included studies

Discussion

This systematic review summarizes studies in literature that investigated the diagnostic accuracy of imaging modalities to detect pseudarthrosis after thoracolumbar spinal fusion with surgical exploration as the reference standard. Diagnostic accuracy values of individual studies were determined, and for studies of the same modality that were clinically comparable, a pooled OR was calculated.

Patients after spinal fusion can be monitored by several modalities. Plain radiographs attempt to reveal deficient morphology of the fusion mass as a sign of pseudarthrosis. However, plain radiographs are projections only [35, 36] whereas pseudarthrosis is a three-dimensional problem. The pooled OR of radiography was 7.07. In FE radiography, radiographs are made during flexion and extension of the spinal column to detect motion in the operated segment as a sign of pseudarthrosis. Cases exist where no signs of pseudarthrosis were found on plain radiography, CT, and MRI, but FE radiography detected the pseudarthrosis by unveiling motion between the segments [37]. However, on the other hand, absence of motion does not necessarily correspond with solid fusion and the presence of motion is not directly related to pseudarthrosis [12, 38,39,40]. Furthermore, no consensus exists on the threshold of allowable motion in a fused segment [40,41,42]. With a pooled OR of 4.00, FE radiography does not seem to outperform plain radiography. In polytomography, several radiographs along different sectional planes are taken. Going from a single slice in radiography to several planes in polytomography, the OR increased to 10.15. However, polytomography seems to be outdated by CT developments and currently not frequently used. CT offers three-dimensional osseous detail [33, 35]. After meta-analysis, CT was the modality with the highest OR in this review. Besides detection of bridging trabecular bone, CT is able to detect subsidence and lucency around fusion material as possible signs of pseudarthrosis [35]. On the downside, assessment can be complicated by artefacts when metallic cages and/or instrumentation are used [14, 32, 33, 35]. Technological improvements such as iterative reconstruction and dual-energy scanning are likely to improve accuracy [43]. Whether CT alone is sufficient for clinical decision-making is under debate. Choudhri et al. stated that multiple modalities should be considered for the noninvasive evaluation of symptomatic patients with suspected failure of spinal fusion [38]. US can demonstrate callus formation and bone healing [44, 45]. Although the first study assessing the role of US for pseudarthrosis detection in ten patients seemed promising in 1997 [29], it has been the only study since.

Pseudarthrosis diagnosis can also be based on abnormalities in bone metabolism. Studies on SPECT and planar scintigraphy were grouped together in meta-analysis since both modalities use 99mTc-labeled phosphonates as tracer. 99mTc-labeled phosphonates are adsorbed onto or into the crystalline structure of hydroxyapatite to mark bone remodeling. With a pooled OR of 2.91, scintigraphy amounted to the lowest OR value after subgroup meta-analyses. An analog to 99mTc-labeled phosphonates is 18F-fluoride. Both tracers have similar uptake mechanisms [46] but 18F-fluoride decays via positron emission and can therefore be imaged by PET. Compared to 99mTc SPECT, 18F-fluoride PET provides higher resolution, higher sensitivity, and better quantification capabilities [47]. PET combined with CT allows localization of abnormal uptake, which might enhance discriminative power [6]. Quon et al. evaluated PET/CT as index test for pseudarthrosis diagnosis [34]. The results seem promising but studies of higher methodological quality should be conducted to draw firmer conclusions on its value in pseudarthrosis diagnosis.

In the database search, one paper evaluating MRI [48] and one paper evaluating RSA as index test [49] were identified but not included. In MRI, bridging bone between endplates can be visualized [50] and changes in the vertebral body marrow signal as a sign of functional instability can be detected [48, 51]. On the downside, metal instrumentation complicates pseudarthrosis assessment in MRI. Length of follow-up was too short for the study of Lang et al. to be included. RSA is able to accurately quantify micromovements of vertebrae relative to each other, to evaluate lumbosacral stability [38, 42]. The study of Pape et al. could not be used to calculate the diagnostic accuracy of RSA for pseudarthrosis detection since all patients attained fusion.

A strength of the present review was that the patient populations of the included studies resemble patient populations that would undergo these tests in clinical practice to either confirm or exclude pseudarthrosis, since all suffered from persisting or recurrent pain after spinal fusion. The methodological choice to only include studies that compared an index modality to the gold standard of surgical exploration was a strength on one hand since it is the most valid way to assess the diagnostic accuracy of a modality [14]. However, it was a weakness on the other hand, since it meant the exclusion of newer studies that evaluate state-of-the-art modalities. The study design of using surgical exploration as gold standard is no longer ethical or practical in clinical practice. As a result, the value of state-of-the-art modalities could not be discussed in this review and are still left to be evaluated. Another weakness of the study was that studies in meta-analysis, although relatively comparable, did show differences in spinal fusion technique, types of cages and instrumentation, imaging characteristics, pseudarthrosis definition, experience of the observers, and patient characteristics. Especially the time interval between spinal fusion and index test was highly variable between studies. Furthermore, the interpretation of index test results was incomplete in some studies. Imaging findings were reported but not always classified as either pseudarthrosis or fused. In these cases, the cut-off point was determined by the writers of this review, which is arbitrary, although not necessarily far from clinical practice. Studies also reported poorly on patient population inclusion criteria. Lack of information may have led to incorrect inclusion of studies in meta-analyses and weakens the findings of this review.

To conclude, with a pooled OR of 17.02, CT can be considered the most accurate non-invasive imaging modality for the detection of pseudarthrosis after spinal fusion from this review.