Introduction

Lumbar degenerative spondylolisthesis (DS) refers to the translation of one vertebra over the adjacent one due to degenerative processes, possibly leading to spinal canal stenosis. DS often triggers severe low back and leg pain [1], affecting 19 to 43% of the older adult population, particularly women [2]. After exhaustion of conservative approaches, two surgical interventions—decompression with or without additional fusion—are generally recommended for DS [3]. Responding to geographical variation in the use of these two interventions [4, 5], some cost-effectiveness analyses have emerged [6, 7], although with uncertain long-term results. Clinical guidelines and systematic reviews have also varied in their recommendations on the optimal intervention for DS [8,9,10]. Furthermore, the three highest-quality randomized controlled trials (RCTs) comparing fusion versus decompression alone for DS have presented methodological differences and inconclusive findings, limiting their direct comparability and application in real-world decision-making [11,12,13,14].

In the Swiss context, uncertainty remains on whether decompression with fusion—consisting in removal of portions of the dorsal bony and ligamentous structures with additional instrumented fusion with pedicle screws and intervertebral cage implantation—results in superior outcomes compared to decompression alone. The relationship between surgical intervention and key healthcare resource utilization also remains unexplored. Physical therapy utilization and oral analgesic use contribute to the direct costs associated with decompression with and without fusion and have not yet been explored as standalone healthcare resource utilization outcomes.

Some prospective cohort studies have evaluated the comparative effectiveness of decompression with or without fusion for DS [15, 16]. Yet, no observational study has included an exploratory healthcare resource utilization analysis, nor used causal inference methods to assess the trustworthiness of observational comparative effectiveness estimates.

Using data from the Lumbar Stenosis Outcome Study (LSOS), our primary objective was to describe and emulate a target trial evaluating the comparative effectiveness of decompression with and without fusion surgery in DS patients at 3-year follow-up. We aimed to benchmark our primary observational comparative effectiveness estimates against those of an index trial at 2 years of follow-up, evaluate success of emulation, and extend the emulation follow-up to 3 years. Our secondary objective was to assess physical therapy utilization and oral analgesic use 3 years after index surgery.

Material and methods

Study design, target trial emulation, and benchmarking against index trial

We described and emulated a hypothetical, pragmatic target trial mimicking a state-of-the-art index RCT comparing decompression with or without fusion for DS—the Norwegian Degenerative Spondylolisthesis and Spinal Stenosis trial (NORDSTEN-DS) [13]. We used data from the Lumbar Stenosis Outcome Study (LSOS), a multicenter cohort study in 4 hospitals and 8 specialized clinical units in Zurich, Switzerland. Details on the LSOS are available elsewhere [17]. The target trial was specified in terms of eligibility criteria, treatment strategies, treatment assignment, outcomes, follow-up, causal estimand, and statistical analysis (sTable 1 in Supplemental File 1). Time zero was the time of meeting eligibility and being assigned to treatment. Primary outcome comparative effectiveness estimates from the emulation were benchmarked against those of the index RCT at 2 years. If comparative effectiveness estimates led to similar clinical decisions as the index RCT, the emulation would be extended to 3 years. [18] The study was preregistered in the Open Science Framework (OSF) [19]. Our study adhered to the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) recommendations [20], and was informed by applicable work-in-progress guidelines [21].

Eligibility criteria

A total of 1716 patients were screened for eligibility to participate in the LSOS from December 2010 to December 2015. To be eligible for LSOS, patients had to be 50 years or older, have unilateral or bilateral neurogenic claudication, with a life expectancy of more than one year, able to provide informed consent, and fluent in German [17]. Exclusion criteria for LSOS were evidence of tumor, fracture, infection, lumbar scoliosis of more than 15°, and peripheral artery occlusive disease. For this analysis, we included patients with lumbar spinal stenosis with additional spondylolisthesis verified by magnetic resonance imaging (MRI) as a slippage of one vertebra over the adjacent one undergoing decompression with or without fusion within 6 months after enrollment in the study. We restricted our study population to patients with complete primary outcome follow-up data for the primary analysis. LSOS followed the ethical principles of the Helsinki Declaration and all applicable laws and regulations. Institutional review board approval was received from the Cantonal Ethic Committee of Zurich (KEK-ZH-Nr: 2010-0395/0).

Treatment groups

Patients undergoing decompression alone received open bilateral decompression or unilateral laminotomy with bilateral decompression of the affected disc level. Those in the decompression and fusion group received additional implantation of pedicle screws with rods plus interbody fusion cages (sFigure 1). Only experienced orthopedic surgeons or neurosurgeons—with more than 10 years of experience⁠—delivered the procedures.

Follow-up and outcomes

The primary outcome was change in health-related quality of life at 3-year follow-up, measured by the EuroQol Health-Related Quality of Life 5-Dimension 3-Level questionnaire (EQ-5D-3L)—a well-established instrument comprising mobility, self-care, usual activities, pain/discomfort, and anxiety/depression dimensions. Health states from the EQ-5D-3L were transformed into a summary index score using country-specific value sets. German and French weights were used to calculate summary indices.

Secondary outcomes at 3-year follow-up included:

  • Change in Numeric Rating Scale (NRS) for back or leg pain intensity. NRS scores range from 0 to 10, with 10 indicating worst pain imaginable.

  • Change in Spinal Stenosis Measure (SSM) satisfaction subscale scores. SSM satisfaction subscale scores range from 1 to 4, with 4 indicating very dissatisfied.

  • Physical therapy utilization—a binary (yes/no) outcome of physical therapy utilization.

  • Oral analgesic use—a binary (yes/no) outcome of oral analgesic intake.

Exploratory adverse event outcomes included:

  • Complications—either intraoperative (e.g., bleeding, dural injury) or post-operative n (e.g., revision due to wound infection, hematoma evacuation, wound revision, dural leakage revision).

  • Revision—due to restenosis, infection or epidural hemorrhage; with additional fusion.

Statistical analysis

The primary analysis compared standardized mean differences in German and French EQ-5D-3L change scores among patients undergoing decompression with or without fusion at 3-year follow-up. Standardized mean differences in change scores and 95% confidence intervals (CIs) were estimated using weighted ordinary least squares models. The secondary analysis compared standardized mean differences in NRS and SSM satisfaction subscale change scores at 3 years. We used inverse probability weighting for confounding control and obtained balanced groups at baseline [22]. Individual observations were weighted by the inverse of their probability of receiving decompression plus fusion surgery given a set of confounders (i.e., the higher the probability, the less the weighting applied to an individual observation). The probability of receiving decompression plus fusion was obtained from a logistic regression model with fusion as dependent variable and the following baseline covariates: age, sex, body mass index (BMI), months since onset of current complaints, compromise of the foraminal zone (left and right), smoking status, educational level, back pain, gluteal pain, thigh pain, lower leg pain, presence of spinal instability (operationalized as facet degeneration with effusion of > 2 mm), spondylolisthesis severity (Meyerding grading system), number of vertebral levels operated, number of vertebral levels with spondylolisthesis (Meyerding I or more), diabetes, civil risk (operationalized as either living alone, or living in a nursing or residential home and being single, divorced, or widowed), anxiety (Hospital Anxiety and Depression Scale [HADS] anxiety subscale), depression (HADS Depression subscale), and EQ-5D-3L scores (German and French). For added modelling flexibility, age, BMI, anxiety, and depression were introduced in the model as beta-splines with three degrees of freedom. Extreme weights were trimmed at the 99th percentile according to best practice [22]. Confounder selection was guided by published literature, clinical expertise, and causal thinking. Two-sided p values < 0.05 indicated statistical significance.

The healthcare resource utilization analysis compared odds of physical therapy utilization and oral analgesic use at 1-, 2-, and 3- year follow-up. Odds ratios and 95% CIs were estimated using weighted generalized linear models with a logit link.

A standardized mean difference superiority threshold favoring decompression plus fusion of 0.20 in the German EQ-5D-3L summary index was prespecified for the primary outcome—informed by suggested minimal clinically important differences in LSOS [23].

Sensitivity analysis

Since we restricted our study population to patients with complete 3-year primary outcome follow-up data for the primary analysis, we assessed the robustness of our comparative effectiveness estimates by repeating the primary analysis including all patients who underwent surgery within 6 months after baseline—irrespective of their follow-up data completeness. Under the partial assumption of missingness at random, we performed multiple imputation by chained equations (MICE) with predictive mean matching, generating 20 imputed datasets for the 258 patients [24]. We analyzed each of the imputed datasets and combined comparative effectiveness estimates using Rubin's rules. All analyses were performed with R version 4.2.2. [25].

Power consideration

With a study population of 215 patients with complete follow-up data, a mean difference of 0.2 in EQ-5D-3L scores and assuming a standard deviation of 0.3 [12], we expected to reach 99% power to be able to reject the null hypothesis of no between-group difference at a 2-sided α level of 0.05. Under similar sample size, standard deviation, and α level considerations and aiming for 80% power, we anticipated to be able to detect a between-group difference of 0.13 in EQ-5D-3L scores.

Results

Of the 258 patients with lumbar degenerative spondylolisthesis and spinal canal stenosis who met the eligibility criteria, 43 did not have 3-year primary outcome data (sTable 2). 215 had complete follow-up data (Fig. 1) and were included in the primary analysis. Of these, 153 (71%) underwent decompression alone (91 female [60%]; mean [standard deviation (SD)] age, 74.6 [7.8] years) and 62 (29%) underwent decompression plus fusion (31 female [50%]; mean [SD] age, 69.4 [7.2] years). After inverse probability weighting and trimming extreme weights, 137 patients remained in the decompression alone group (77 female [56%]; mean [SD] age, 73.9 [7.5] years) and 36 in the decompression plus fusion group (18 female [50%]; mean [SD] age, 70.1 [6.7] years). Table 1 presents patient characteristics before and after inverse probability weighting. Although weights created comparable groups for most baseline characteristics (sFigure 2), there were residual differences in age, severity of spondylolisthesis, and number of vertebral levels operated (see Table 1).

Fig. 1
figure 1

Study flowchart

Table 1 Baseline patient characteristics before and after inverse probability of treatment weighting

Primary outcomes

Health-related quality of life EQ-5D-3L index scores were comparable at 3 years [β for standardized mean differences in change scores, German: 0.07, 95% CI (− 0.25 to 0.39), P = 0.66; β, French: 0.18, 95% CI (− 0.14 to 0.50), P = 0.26]. Table 2 shows the standardized mean difference in EQ-5D-3L change scores at 1-, 2-, and 3-year follow-up (see also sTable 3 and Fig. 2).

Table 2 Primary, secondary, and healthcare utilization outcomes
Fig. 2
figure 2

Standardized mean differences in change (95% CI) for health-related quality of life scores at 1-, 2-, and 3-year follow-up

Secondary outcomes and selected health services assessment

There were no differences in back/leg pain intensity (NRS) [β: − 0.19, 95% CI (− 0.52 to 0.13); P = 0.25] or patient satisfaction change (SSM satisfaction subscale) [β: − 0.21, 95% CI (− 0.53 to 0.11); P = 0.21] 3 years after index operation. Patients undergoing decompression plus fusion had less pain on average [β: − 0.39, 95% CI (− 0.71 to − 0.06); P = 0.02] at the 1-year follow-up, but this difference did not persist at 3 years (Table 2, sTable 3).

Patients undergoing decompression plus fusion surgery had higher odds of seeking physical therapy at 3 years [Odds ratio (OR): 2.25, 95% CI (1.00 to 5.00); P < 0.05]. There was no difference in oral analgesic use at the 3-year follow-up [OR: 1.14, 95% CI (0.55 to 2.39); P = 0.72] (Table 2).

Exploratory adverse event outcomes were comparable between groups at the end of follow-up (sTable 4).

Target trial emulation, benchmarking, and extended follow-up

We benchmarked our observational comparative effectiveness estimates for the primary outcome from the target trial emulation against the modified-intention-to-treat estimates of the index RCT (NORDSTEN-DS) at 2 years (Fig. 3). The mean differences in change in EQ-5D-3L summary scores were comparable in direction and magnitude, leading to similar clinical decisions [NORDSTEN EQ-5D-3L, 2 years: 0.08, 95% CI (0.00 to 0.15); emulated target trial EQ-5D-3L, German, 2 years: 0.02, 95% CI (− 0.03 to 0.07); emulated target trial EQ-5D-3L, French, 2 years: 0.03, 95% CI (− 0.05 to 0.11)].

Fig. 3
figure 3

Mean differences in change (95% CI) for health-related quality of life scores from the NORDSTEN-DS trial and a target trial emulation using Lumbar Stenosis Outcome Study (LSOS) observational data

We deemed our target trial emulation acceptable and extended our primary outcome comparative effectiveness estimates to 3 years. The mean difference in EQ-5D-3L summary index change score was 0.01, 95% CI (− 0.04 to 0.07) using German, and 0.05, 95% CI (− 0.04 to 0.13) using French weights at 3 years (Fig. 3).

Sensitivity analysis

Having performed MICE for missing primary outcome follow-up data to include all patients who underwent surgery within 6 months after baseline (n = 258) in the sensitivity analysis, the EQ-5D-3L comparative effectiveness estimates were consistent in direction with those of the primary analysis (see sTable 5). Although comparative effectiveness estimates increased in magnitude and were more precise, the results of the sensitivity analysis were consistent with those of the primary (complete case) analysis at 3 years follow-up.

Discussion

Statement of major findings

In our target trial emulation and index RCT benchmarking of DS patients, decompression plus fusion was not found to be superior to decompression alone in terms of health-related quality of life 3 years after surgery. Patients undergoing decompression plus fusion did not achieve better pain or satisfaction outcomes at 3 years. Patients undergoing decompression plus fusion were more likely to seek physical therapy 3 years after surgery, although the clinical relevance of this finding is uncertain. There were no between-group differences in oral analgesic use during the study period.

Similar observational studies and importance of findings

These results are in line with other observational studies [26,27,28,29], including a recent LSOS study in patients with degenerative lumbar stenosis, with and without DS [30]. Yet, our observational comparative effectiveness analysis applied inverse probability of treatment weighting to control for confounding and explicitly emulated and benchmarked against a pragmatic index target trial—two causal inference methods that increase the trustworthiness of the findings and prevent self-inflicted biases [31,32,33]. We reported all necessary information related to the emulation of the target trial [34], used health-related quality of life as the primary outcome, and complemented our comparative effectiveness analysis with an exploration of healthcare resource utilization.

Findings in the context of high-quality RCTs

Three high-quality RCTs have been published comparing decompression alone and decompression plus fusion for DS [11,12,13], two of which present similar findings to ours [12, 13]. Interestingly, the Spinal Laminectomy versus Instrumented Pedicle Screw (SLIP, United States) and the Swedish Spinal Stenosis Study (SSSS, Sweden)—both superiority RCTs—differed in their conclusions [11, 12]. These differences may be explained by trial design decisions. SLIP and SSSS were methodologically distinct, with different eligibility criteria and primary outcomes. The more recent NORDSTEN-DS—opting for a noninferiority design—concluded that decompression alone was noninferior to decompression plus fusion [13]. All 3 trials selected 2-year follow-up primary endpoints and enrolled younger populations compared to our study.

Clinical relevance

Although RCTs are the preferred study design for evaluating the effectiveness of interventions, their external validity can be limited [35]. Our study attempted to use causal inference methods for observational data to inform real-world DS decision-making. Our findings provide generalizable comparative effectiveness estimates, complementing and extending the index RCT inferences in a study population of older DS patients and to a 3-year post-intervention time point. We also provide exploratory evidence of the long-term utilization of physical therapy services and oral analgesic use after decompression with or without additional fusion. The exploratory healthcare resource assessment findings should be interpreted with caution. Yet, physical therapy and oral analgesic use could be reasonable proxies of other services.

Implications for surgical practice and suggestions for future research

Our comparative effectiveness findings do not support the more invasive and expensive decompression plus fusion surgery as a first-line surgical intervention for DS patients with concomitant lumbar spinal stenosis. Since the decision to perform decompression plus fusion may be influenced by clinical, radiographic, and surgeon characteristics [36,37,38], future classification systems of instability in DS patients and validated operational definitions of spinal instability are needed. Future research should explore treatment effect modifiers and account for revision rate differences between the two interventions.

Novel surgical approaches such as vertebropexy—a ligamentous semi-rigid alternative to instrumented fusion—are also underway with interesting preclinical results that raise potentially testable hypotheses [39, 40]. These novel interventions require high-quality observational and interventional studies to assess their benefits and harms.

Limitations

Our study has limitations. First, our findings are vulnerable to unmeasured and residual confounding. Despite using inverse probability of treatment weighting, we were unable to achieve balance in three baseline characteristics—age, severity of spondylolisthesis, and number of vertebral levels operated. Lack of treatment variation among patients with similar characteristics, and our small sample size contributed to these residual differences. Second, we accounted for the influence of extreme weights and trimmed at the 99th percentile. As a result, our study population was considerably smaller after inverse probability weighting. Third, we used a radiographic operational definition of spinal instability—facet joint degeneration with more than two millimeters of effusion [41]—since dynamic radiographs were not implemented for inclusion in the LSOS cohort. Fourth, our primary analysis was restricted to cases with complete 3-year follow-up primary outcome data. To assess selection bias from the complete case analysis, we performed a sensitivity analysis including all eligible patients undergoing surgery within 6 months after baseline—with and without 3-year primary outcome follow-up data.

Conclusions

In our target trial emulation of patients with lumbar DS and spinal canal stenosis, decompression with additional fusion did not result in superior health-related quality of life, pain intensity, or satisfaction outcomes compared to decompression alone at 3 years. Patients undergoing decompression plus fusion for DS had a higher likelihood of seeking physical therapy 3 years after surgery, although the clinical relevance of this finding is uncertain. Although results may be susceptible to unmeasured and residual confounding, observational data provided acceptable comparative effectiveness estimates beyond those addressed by an index RCT.