Introduction

The efficacy of a new surgical procedure should be tested in an RCT if possible or at least against a control group. Depending on the research question, the follow-up period may vary from months to decades. In effectiveness studies, data from a high-quality registry offer a unique opportunity to study the outcome of interventions—whose efficacy is previously documented—on a large scale and in clinical praxis. To make conclusions about the outcome of surgery for degenerative spine conditions, most scientific journals and authors consider a follow-up period of at least two years to be necessary. Recent studies indicate, however, that a shorter evaluation time may be sufficient since the improvement curve for patient-reported outcome measures (PROMs) has been reported to level out and patients reach a threshold of change somewhere between 3 and 12 months after surgery [1, 2].

The Swedish Spine Register, Swespine, is controlling the quality of lumbar spine surgery by sending follow-up questionnaires at 1, 2, 5 and 10 years post-operatively. It offers a unique possibility to evaluate any differences in outcome between one- and two-year follow-up in a large real-life database [3]. The response rates are approximately 75% at follow-up 1 at one year post-op (FU1) and 65% at follow-up 2 at two years post-op (FU2).

Our aim was to study the outcome of lumbar surgery, measured with PROMs, at one and two years after procedure, with the specific question: are there clinically important differences in outcome between one- and two-year follow-up?

Patients and methods

Characteristics of the population

Analyses were performed on prospectively collected data from 98,732 degenerative lumbar spine surgical procedures registered in the Swespine spine register [4] between 1998 and 2017. To facilitate comparisons with previous trials and to find potential sub-population differences, the population was divided into the three diagnostic groups: Lumbar Disc Herniation (LDH), Lumbar Spinal Stenosis (LSS) and Degenerative Disc Disease (DDD).

The LDH group consisted of patients with disc herniation treated with most often solely a discectomy but for some supplemented with a fusion. The LSS group consisted of an older and more heterogeneous population with greater degenerative changes causing spinal stenosis. The number of levels, the severity of the degeneration and the location of the stenosis (in the central canal, in the lateral recess or in the neural foramen) generate a variety of treatment options from minimally invasive nerve root decompression where midline structures are preserved, to multilevel decompression and fusion. The DDD group conveyed patients with severe chronic low back pain as the main symptom and fusion or total disc replacement as the single treatment.

Inclusion and exclusion criteria are shown in the study flowchart in Fig. 1. Details of the registry have been previously described [5].

Fig. 1
figure 1

Study flowchart. LDH Lumbar Disc Herniation, LSS Lumbar Spinal Stenosis, DDD Degenerative Disc Disorders, FU1/2 follow-up at 1 and 2 years post-operatively

Outcome variables (patient-reported outcome measures—PROMs)

The Visual Analogue Scale for back (VASBACK) and leg (VASLEG) pain is measured on a scale from 0 (no pain) to 100 (worst imaginable pain) [6]; the disease-specific measure such as Oswestry Disability Index (ODI) [7] aims to capture important aspects of low back pain influencing function in everyday life activities with the index score ranging from 1 (best) to 100 (worst); the quality-of-life measure EuroQoL-5-Dimensions [8] is translated into a preference-based single index score based on a scale from 0.000 (death) to 1.000 (perfect health); the single-item transition questions such as Satisfaction (‘How would you describe your satisfaction with the surgical outcome?’ 1/Satisfied, 2/Uncertain, 3/dissatisfied) and Global Assessment for back pain (GABACK) and leg pain (GALEG) (‘How is your back/leg pain today as compared to before the surgery?’ where 0 represents no back/leg pain before the surgery, 1 completely pain free, 2 much better, 3 somewhat better, 4 unchanged, 5 worse) [9] are used at follow-up.

Study design

Our choice of an equivalence study design was based on the expectation that the result after a follow-up period of one year is not different from the result after a follow-up of two years by more than an acceptable amount—given that the endpoint variable is a PROM. Therefore, confirmation of similarity in this trial involved the pre-specification of the clinically relevant minimal important change (MIC) of improvement for each PROM.

Statistical methods

Descriptive data

Descriptive data are presented as means ± SD or numbers (%).

Minimal important change (MIC)

The MIC represents the smallest amount of change in an outcome that is considered important to the patient. It is expressed in the units used in the measurement. MIC values were calculated for each PROM to set appropriate margins of clinical relevance in the analyses of possible differences between outcomes at FU1 and FU2. In a ROC curve analysis, the MIC corresponds to the optimal ROC cut-off point, which is the value for which the sum of the percentages of false positive and false negative classifications is the smallest [10]. As reference criteria, the transition questions such as Global Assessment for back or leg pain (GABACK/LEG) were used. GABACK/LEG was dichotomized into ‘success’, corresponding to a considerable improvement (i.e. those who responded as ‘completely pain free’ or ‘much better’) and ‘not success’ (i.e. those who responded ‘somewhat better’, ‘unchanged’ or ‘worse’). The choice of using an anchor-based approach was based on the recommendation by the COSMIN expert panel [11].

Clinically relevant differences between FU1 and FU2

For ODI, VASBACK/LEG and EQ-5D, differences on a group level were explored by the calculation of mean differences between baseline and follow-up at one year and between the first year and second year. p values were not presented since even very small differences in large study populations tend to show a statistical significance. Instead, the differences were related to the MIC value as well as to a threshold of treatment success at one year post-operatively of each PROM. Thresholds of treatment success are an alternative to the MIC. The statistical method here was the same as the one for computing MIC values described above, but instead of score changes, final scores were used. The proportion of patients reaching these cut-off points at FU1 and at FU2 were calculated.

For the retrospective single-item questions such as GABACK/LEG and Satisfaction, the test for two paired proportions (McNemar’s test) [12] was used to explore to what extent patients moved from the successful outcome group to the unfavourable outcome group or vice versa between 1 and 2 years of follow-up. GABACK/LEG was dichotomized as described above. Satisfaction was dichotomized into ‘success’ (i.e. ‘satisfied’) and ‘not success’ (i.e. ‘uncertain’ or ‘dissatisfied’).

The SPSS version 24 was used for the statistical analyses except for the computation of the MIC values, where SAS JMP was used.

Ethical considerations

Informed consent was obtained from all participants in Swespine. This research project was approved by the regional ethical review board.

Results

ODI, VAS BACK/LEG and EQ-5D

Baseline PROM data

Demographic and baseline PROM data are presented in Table 1 and follow-up PROM data in Table 2. The LSS group had the lowest mean score in ODI, 43 (± 16). The DDD group had the highest score in VASBACK, 67 (± 22). The LDH group had the highest VASLEG score, 67 (± 26), as well as the lowest quality of life as measured by the EQ-5D, 0.26 (± 0.34).

Table 1 Baseline data
Table 2 Mean values at the two follow-up occasions and their mean differences in patient-reported outcome measures stratified by three different diagnostic groups. Paired data

Mean differences between baseline and FU1 and between FU1 and FU2

The mean differences between baseline and FU1 and between FU1 and FU2 are visualized in Fig. 2. In the LDH group, the mean differences between baseline and FU1 were − 29(± 21) for ODI,  − 24(± 32) for VASBACK,  − 46(± 35) for VASLEG and 0.47(± 0.40) for EQ-5D. In the LSS group, the corresponding numbers were − 17(± 18) for ODI,  − 25(± 32) for VASBACK,  − 31(± 35) for VASLEG and 0.29(± 0.35) for EQ-5D, and in the DDD group,  − 20(± 18) for ODI,  − 36(± 31) for VASBACK,  − 23(± 30) for VASLEG and 0.32(± 0.37) for EQ-5D. The mean differences between FU1 and FU2 (Table 2) were negligible in comparison with the changes seen between baseline and FU1.

Fig. 2
figure 2

Line charts of pre-operative and post-operative mean values of the VAS back and leg pain, ODI and EQ-5D index highlighting a large change during the first post-operative year and only small variations thereafter. The LDH group corresponds to the continuous lines, the LSS group to the dotted lines, and the DDD group to the broken lines. The error bars are 95% CI. LDH Lumbar Disc Herniation, LSS Lumbar Spinal Stenosis, DDD Degenerative Disc Disorders, VASBACK/LEG Visual Analogue Scale for back and leg pain, ODI Oswestry Disability Index, EQ-5D EuroQol-5-Dimensions index score

In Fig. 3a–d, score changes on the individual level are illustrated by histograms. These figures highlight the large spread in score change occurring up to one year post-op in contrast to the harmonious and narrow bell-shaped histograms depicting the changes between FU1 and FU2, where the median equals ‘0’.

Fig. 3
figure 3

Histograms visualizing the distribution of changes in scores between baseline and FU1 and between FU1 and FU2 for ODI, VAS back and leg pain and EQ-5D in the three diagnostic groups: LDH, LSS and DDD. The number of patients is presented in Tables 3 and 4. LDH Lumbar Disc Herniation, LSS Lumbar Spinal Stenosis, DDD Degenerative Disc Disorders, VASBACK/LEG Visual Analogue Scale for back and leg pain, ODI Oswestry Disability Index, EQ-5D EuroQol-5-Dimensions index score

Proportions reaching MIC and thresholds of treatment success at FU1 and FU2

To detect clinically relevant differences in outcome between FU1 and FU2, the MIC—corresponding to a considerable improvement—was computed for each PROM and diagnostic group (Table 3). The difference in proportion reaching MIC at FU2 as compared to FU1 in any of the PROMs in the LDH and the DDD groups was at the most 2% smaller. In the LSS group, 2–4% fewer patients reached MIC at FU2 than at FU1. The percentage reaching thresholds of treatment success at the two follow-up occasions (Table 4) displayed a similar pattern. The largest deterioration (5%) was seen in VASBACK in the LSS group.

Table 3 MIC for treatment success based on PROM score changes between baseline and FU1 and FU2, respectively. The analyses are based on paired data
Table 4 Thresholds of treatment success based on absolute PROM scores at one and two years of follow-up. Paired data

GA BACK/LEG and Satisfaction

Test of two paired proportions of Satisfaction and GA BACK/LEG

As shown in Table 5, 85–89% of the patients, depending on the diagnosis, made the same self-assessment at FU1 as at FU2. In the LDH group, 6% switched from being satisfied (‘success’) at FU1 to being uncertain or unsatisfied (‘not success’) at FU2 and 5.5% shifted from being uncertain or unsatisfied to being satisfied. The corresponding numbers in the LSS and DDD groups were somewhat higher but showed a similar pattern. Despite the relatively small differences, a statistical significance was reached in the latter groups implying a deterioration in outcome between FU1 and FU2.

Regarding patient self-assessment of changes in back pain and leg pain (Table 6), 82–85% of patients displayed the same outcome at the two follow-up occasions. Apart from responses on GABACK in the DDD group, there was a statistically significant deterioration in outcome between FU1 and FU2, even though the difference between patients shifting from a better outcome to a worse outcome was no more than 2%.

Table 5 Differences in patient responses on Satisfaction between FU1 and FU2
Table 6 Differences in patient responses on GABACK/LEG between FU1 and FU2

Discussion

This study confirms that potential improvements of clinical importance occur during the first year after lumbar surgery, irrespective of diagnosis and which PROM is being used. The mean differences between FU1 and FU2 in ODI, VASBACK/LEG and EQ-5D as well as the proportions reaching MIC indicated a minor deterioration in outcome between the first year and second year of follow-up, in all of the three diagnosis groups. However, a deterioration of this size could be expected in populations with degenerative conditions. Our data confirm the results by Adogwa et al. [2] who concluded that ODI and VASBACK/LEG obtained at 12 months adequately predict the outcomes at 24 months in patients operated with lumbar nerve root decompression and fusion. Glassman et al. [13] found no differences between FU1 and FU2 in the ODI and the NRSBACK/LEG in patients with adult spinal deformity. In a recent study, Kim et al. concluded that 1-year outcomes reliably can predict 2-year outcomes for discectomy, but it was not clear for laminectomy or fusion procedures [14]. The disconcordant results may be caused by the use of different methods compared to our study. In the report by Kim et al., the conclusion was based on a model where a change in ODI score of just one point could turn a meaningful outcome into a non-meaningful outcome, which was also pointed out by the authors. Small differences between two timepoints in a change score are likely to be seen and are possibly caused by normal fluctuations.

The proportion shifting from ‘success’ on GALEG and GABACK on FU1 to ‘not success’ on FU2 was 8.5% and 8% in the LDH group, 11% and 10% in the LSS group and 10% and 8% in the DDD group. Some of these individuals may represent well-known causes for reoperation such as recurring disc herniation, incomplete decompression, adjacent level stenosis and pseudarthrosis. The movements from ‘success’ to ‘not success’ and vice versa between the two follow-ups may also be manifestations of normal symptom fluctuations, recall bias or response shift and other measurement errors in PROMs.

Also, worth noticing is that the ‘not success’ allotment does not entirely consist of patients experiencing a deterioration. Included in this category are also individuals responding as ‘somewhat better’ and ‘unchanged’. Indeed, when investigating the number of patients shifting from ‘pain free’ or ‘much better’ to ‘worse’, it turns out that no more than 2.5% of the whole study population shift from ‘success’ at FU1 to ‘worse’ at FU2 on GALEG, the corresponding percentage for GABACK being 1.8%. Although there was a statistically significant deterioration in outcome between FU1 and FU2, the relevance in clinical practice can certainly be questioned.

These findings suggest that a follow-up at both one and two years post-surgery in effectiveness studies is unnecessary.

The difficulties in reaching a consensus regarding the definition of a minimal important change relevant to patients (MIC) and the most appropriate method to compute it [15, 16] have led to the search for new strategies to define a clinically relevant outcome. Such an alternative is the definition of cut points of treatment success based on absolute scores as opposed to change in scores [17]. Such a threshold is the value above (or below depending on the scale of the PROM) which a patient considers the magnitude of his or her encumbrance to be acceptable following the surgery. A final score may be less affected by response shift and recall bias than a change score would, although that possible advantage is yet to be proven. Thresholds of success based on absolute post-operative scores at FU1 and FU2, respectively, were estimated as previously described by Tubach et al. [18] and recently by van Hooff et al. in a lumbar pain population [17]. No relevant differences could be found.

The number of patients reaching the thresholds of success was rather low in the LSS group—approximately 47–60% depending on the PROM. The cause might be the stringent definition of a successful outcome, leaving those that responded as ‘somewhat better’ and ‘unchanged’ as unsuccessful. Other contributing factors to this finding may be that (i) the LSS population is relatively older, with a larger prevalence of comorbidity and probably other painful degenerative conditions such as hip arthritis, that may confound the outcome of the surgery if measured by PROMs, (ii) the indication for surgery is too wide and (iii) the degeneration of the spine is often a multisegmental process that opens for future pain and disability. In this study, we chose not to make a distinction between LSS patients undergoing decompression surgery only and those treated with decompression and fusion surgery. The decision was based on the studies by Försth et al., who concluded that the two groups had similar clinical outcomes at 2 years [19, 20].

There appears to be no need for a 2-year follow-up of PROMs in this population and resources could be better spent on increasing the response rates at one year for registries, thereby improving the quality of the data and reducing the risks of assessment bias. Efforts could also be made to define certain groups of patients or procedures where a longer follow-up period such as 2 years may be needed—or shorter. The value of PROM assessment based on Swespine data collected at 5 and 10 years post-surgery is unclear and is yet to be studied. The more time has passed since the operation, the higher the risk of other health-related events aggravating the interpretation of the PROMs. More ‘objective’ endpoints, such as reoperation rates, cause of reoperation, time from index surgery to a new surgical event or in relation to implant survival, might be of greater importance than PROMs in the very long-term assessments.

Limitations

Although the strength of this study is the large real-life database, a selection bias may be present because of the proportion of non-respondents. In this study, the response rates were 75% at FU1 and 65% at FU2 and adjustments for risk factors associated with responding were not made. Solberg and colleagues concluded, however, that there were no differences in outcome between respondents and non-respondents in a population with degenerative lumbar disorders retrieved from the Norwegian spine registry and that the non-respondents could be treated as missing at random [21]. The same conclusion was later drawn in a similar study based on data from the Danish spine registry [22]. The high similarity between these countries should make the results applicable to the current study. The EQ-5D index is a measure designed for cost-effectiveness analyses and not for a similarity study such as this one. Therefore, the interpretation of the results involving EQ-5D should be made with caution. As in all studies where means are calculated and compared—longitudinally or cross-sectionally—there is a risk that patients worsening and improving neutralize each other to some extent. As illustrated in Fig. 3, this risk appears to be low. This study investigated whether there are any clinically important differences in PROMs between 1 and 2 years of follow-up or not. However, possible differences in outcome between 1 and 5 or 10 years are yet to be shown.

Conclusion

All PROMs used in Swespine in follow-up of degenerative lumbar spine surgery demonstrate minor differences between one and two years that can be expected in populations symptomatically treated for degenerative conditions. This suggests that when a PROM is the outcome variable, a one-year follow-up is sufficient in clinical effectiveness studies.