Follow-up of degenerative lumbar spine surgery—PROMs stabilize after 1 year: an equivalence study based on Swespine data

To evaluate the outcome of degenerative lumbar spine surgery in a credible way, patient-reported outcome measures (PROMs) should be assessed after an adequate follow-up period. Most authors/journals consider a follow-up period of less than two years to be too short. The purpose of this study was to explore the possibility of restricting follow-up to one year. Adult patients operated between 1998 and 2017 were retrieved from Swespine (Lumbar Disc Herniation n = 31,314, Lumbar Spinal Stenosis n = 53,043 and Degenerative Disc Disease n = 14,375). The proportion reaching the minimal important change (MIC) in Visual Analogue Scale for pain (VASBACK/LEG), Oswestry Disability Index (ODI) and the quality-of-life measure EQ-5DINDEX at 1 and 2 years, respectively, was calculated. The single-item questions such as Global Assessment (GABACK/LEG) and Satisfaction were analysed by the McNemar test. Threshold values for a successful outcome based on the final scores of each PROM at 1 and 2 years post-surgery were also defined. For all the three diagnostic groups, the differences in proportions reaching MIC of each PROM at 1 and 2 years were below 2%. Global Assessment and Satisfaction with outcome at one year remained at 2 years. There were no important differences of threshold values of treatment success based on final scores No clinically important changes in PROMs appeared between 1 and 2 years after surgery for degenerative lumbar conditions, demonstrating that a follow-up period of 1 year as opposed to 2 years is sufficient in effectiveness studies if PROMs are to be used as outcome variables. These slides can be retrieved under Electronic Supplementary Material.


Introduction
The efficacy of a new surgical procedure should be tested in an RCT if possible or at least against a control group. Depending on the research question, the follow-up period may vary from months to decades. In effectiveness studies, 1 3 data from a high-quality registry offer a unique opportunity to study the outcome of interventions-whose efficacy is previously documented-on a large scale and in clinical praxis. To make conclusions about the outcome of surgery for degenerative spine conditions, most scientific journals and authors consider a follow-up period of at least two years to be necessary. Recent studies indicate, however, that a shorter evaluation time may be sufficient since the improvement curve for patient-reported outcome measures (PROMs) has been reported to level out and patients reach a threshold of change somewhere between 3 and 12 months after surgery [1,2].
The Swedish Spine Register, Swespine, is controlling the quality of lumbar spine surgery by sending follow-up questionnaires at 1, 2, 5 and 10 years post-operatively. It offers a unique possibility to evaluate any differences in outcome between one-and two-year follow-up in a large real-life database [3]. The response rates are approximately 75% at follow-up 1 at one year post-op (FU1) and 65% at follow-up 2 at two years post-op (FU2).
Our aim was to study the outcome of lumbar surgery, measured with PROMs, at one and two years after procedure, with the specific question: are there clinically important differences in outcome between one-and two-year follow-up?

Characteristics of the population
Analyses were performed on prospectively collected data from 98,732 degenerative lumbar spine surgical procedures registered in the Swespine spine register [4] between 1998 and 2017. To facilitate comparisons with previous trials and to find potential sub-population differences, the population was divided into the three diagnostic groups: Lumbar Disc Herniation (LDH), Lumbar Spinal Stenosis (LSS) and Degenerative Disc Disease (DDD).
The LDH group consisted of patients with disc herniation treated with most often solely a discectomy but for some supplemented with a fusion. The LSS group consisted of an older and more heterogeneous population with greater degenerative changes causing spinal stenosis. The number of levels, the severity of the degeneration and the location of the stenosis (in the central canal, in the lateral recess or in the neural foramen) generate a variety of treatment options from minimally invasive nerve root decompression where midline structures are preserved, to multilevel decompression and fusion. The DDD group conveyed patients with severe chronic low back pain as the main symptom and fusion or total disc replacement as the single treatment.
Inclusion and exclusion criteria are shown in the study flowchart in Fig. 1. Details of the registry have been previously described [5].

Outcome variables (patient-reported outcome measures-PROMs)
The Visual Analogue Scale for back (VAS BACK ) and leg (VAS LEG ) pain is measured on a scale from 0 (no pain) to 100 (worst imaginable pain) [6]; the disease-specific measure such as Oswestry Disability Index (ODI) [7] aims to capture important aspects of low back pain influencing function in everyday life activities with the index score ranging from 1 (best) to 100 (worst); the quality-of-life measure EuroQoL-5-Dimensions [8] is translated into a preference-based single index score based on a scale from 0.000 (death) to 1.000 (perfect health); the single-item transition questions such as Satisfaction ('How would you describe your satisfaction with the surgical outcome?' 1/Satisfied, 2/Uncertain, 3/dissatisfied) and Global Assessment for back pain (GA BACK ) and leg pain (GA LEG ) ('How is your back/leg pain today as compared to before the surgery?' where 0 represents no back/leg pain before the surgery, 1 completely pain free, 2 much better, 3 somewhat better, 4 unchanged, 5 worse) [9] are used at follow-up.

Study design
Our choice of an equivalence study design was based on the expectation that the result after a follow-up period of one year is not different from the result after a follow-up of two years by more than an acceptable amount-given that the endpoint variable is a PROM. Therefore, confirmation of similarity in this trial involved the pre-specification of the clinically relevant minimal important change (MIC) of improvement for each PROM.

Descriptive data
Descriptive data are presented as means ± SD or numbers (%).

Minimal important change (MIC)
The MIC represents the smallest amount of change in an outcome that is considered important to the patient. It is expressed in the units used in the measurement. MIC values were calculated for each PROM to set appropriate margins of clinical relevance in the analyses of possible differences between outcomes at FU1 and FU2. In a ROC curve analysis, the MIC corresponds to the optimal ROC cut-off point, which is the value for which the sum of the percentages of false positive and false negative classifications is the smallest [10]. As reference criteria, the transition questions such as Global Assessment for back or leg pain (GA BACK/LEG ) were used. GA BACK/LEG was dichotomized into 'success', corresponding to a considerable improvement (i.e. those who responded as 'completely pain free' or 'much better') and 'not success' (i.e. those who responded 'somewhat better', 'unchanged' or 'worse'). The choice of using an anchor-based approach was based on the recommendation by the COSMIN expert panel [11].

Clinically relevant differences between FU1 and FU2
For ODI, VAS BACK/LEG and EQ-5D, differences on a group level were explored by the calculation of mean differences between baseline and follow-up at one year and between the first year and second year. p values were not presented since even very small differences in large study populations tend to show a statistical significance. Instead, the differences were related to the MIC value as well as to a threshold of treatment success at one year post-operatively of each PROM. Thresholds of treatment success are an alternative to the MIC. The statistical method here was the same as the one for computing MIC values described above, but instead of score changes, final scores were used. The proportion of For the retrospective single-item questions such as GA BACK/LEG and Satisfaction, the test for two paired proportions (McNemar's test) [12] was used to explore to what extent patients moved from the successful outcome group to the unfavourable outcome group or vice versa between 1 and 2 years of follow-up. GA BACK/LEG was dichotomized as described above. Satisfaction was dichotomized into 'success' (i.e. 'satisfied') and 'not success' (i.e. 'uncertain' or 'dissatisfied').
The SPSS version 24 was used for the statistical analyses except for the computation of the MIC values, where SAS JMP was used.

Ethical considerations
Informed consent was obtained from all participants in Swespine. This research project was approved by the regional ethical review board.

Baseline PROM data
Demographic and baseline PROM data are presented in Table 1 and follow-up PROM data in Table 2. The LSS group had the lowest mean score in ODI, 43 (± 16). The DDD group had the highest score in VAS BACK , 67 (± 22). The LDH group had the highest VAS LEG score, 67 (± 26), as well as the lowest quality of life as measured by the EQ-5D, 0.26 (± 0.34).
In Fig. 3a-d, score changes on the individual level are illustrated by histograms. These figures highlight the large spread in score change occurring up to one year post-op in contrast to the harmonious and narrow bell-shaped histograms depicting the changes between FU1 and FU2, where the median equals '0'.

Proportions reaching MIC and thresholds of treatment success at FU1 and FU2
To detect clinically relevant differences in outcome between FU1 and FU2, the MIC-corresponding to a considerable improvement-was computed for each PROM and diagnostic group (Table 3). The difference in proportion reaching MIC at FU2 as compared to FU1 in any of the PROMs in the LDH and the DDD groups was at the most 2% smaller. In the LSS group, 2-4% fewer patients reached MIC at FU2 than at FU1. The percentage reaching thresholds of treatment success at the two follow-up occasions (Table 4) displayed a similar pattern. The largest deterioration (5%) was seen in VAS BACK in the LSS group.

Test of two paired proportions of Satisfaction and GA BACK/LEG
As shown in Table 5, 85-89% of the patients, depending on the diagnosis, made the same self-assessment at FU1 as at FU2. In the LDH group, 6% switched from being satisfied ('success') at FU1 to being uncertain or unsatisfied ('not success') at FU2 and 5.5% shifted from being uncertain or unsatisfied to being satisfied. The corresponding numbers in the LSS and DDD groups were somewhat higher but showed a similar pattern. Despite the relatively small differences, a statistical significance was reached in the latter groups implying a deterioration in outcome between FU1 and FU2. Regarding patient self-assessment of changes in back pain and leg pain (Table 6), 82-85% of patients displayed the same outcome at the two follow-up occasions. Apart from responses on GA BACK in the DDD group, there was a statistically significant deterioration in outcome between FU1 and FU2, even though the difference between patients shifting from a better outcome to a worse outcome was no more than 2%.

Discussion
This study confirms that potential improvements of clinical importance occur during the first year after lumbar surgery, irrespective of diagnosis and which PROM is being used. The mean differences between FU1 and FU2 in ODI, VAS BACK/LEG and EQ-5D as well as the proportions reaching MIC indicated a minor deterioration in outcome between the first year and second year of follow-up, in all of the three diagnosis groups. However, a deterioration of this size could be expected in populations with degenerative conditions. Our data confirm the results by Adogwa et al. [2] who concluded that ODI and VAS BACK/LEG obtained at 12 months adequately predict the outcomes at 24 months in patients operated with lumbar nerve root decompression and fusion. Glassman et al. [13] found no differences between FU1 and FU2 in the ODI and the NRS BACK/LEG in patients with adult spinal deformity. In a recent study, Kim et al. concluded that 1-year outcomes reliably can predict 2-year outcomes for discectomy, but it was not clear for laminectomy or fusion procedures [14]. The disconcordant results may be caused by the use of different methods compared to our study. In the report by Kim et al., the conclusion was based on a model where a change in ODI score of just one point could turn a meaningful outcome into a non-meaningful outcome, which was also pointed out by the authors. Small differences between two timepoints in a change score are likely to be seen and are possibly caused by normal fluctuations.
The proportion shifting from 'success' on GA LEG and GA BACK on FU1 to 'not success' on FU2 was 8.5% and 8% in the LDH group, 11% and 10% in the LSS group and 10% and 8% in the DDD group. Some of these individuals may represent well-known causes for reoperation such as recurring disc herniation, incomplete decompression, adjacent level stenosis and pseudarthrosis. The movements from 'success' to 'not success' and vice versa between the two follow-ups may also be manifestations of normal symptom fluctuations, recall bias or response shift and other measurement errors in PROMs.
Also, worth noticing is that the 'not success' allotment does not entirely consist of patients experiencing a deterioration. Included in this category are also individuals responding as 'somewhat better' and 'unchanged'. Indeed, when investigating the number of patients shifting from 'pain free' or 'much better' to 'worse', it turns out that no more than 2.5% of the whole study population shift from 'success' at FU1 to 'worse' at FU2 on GA LEG, the corresponding percentage for GA BACK being 1.8%. Although there was a statistically significant deterioration in outcome between FU1 and FU2, the relevance in clinical practice can certainly be questioned. These findings suggest that a follow-up at both one and two years post-surgery in effectiveness studies is unnecessary.
The difficulties in reaching a consensus regarding the definition of a minimal important change relevant to patients (MIC) and the most appropriate method to compute it [15,16] have led to the search for new strategies to define a clinically relevant outcome. Such an alternative is the definition of cut points of treatment success based on absolute scores as opposed to change in scores [17]. Such a threshold is the value above (or below depending on the scale of the PROM) which a patient considers the magnitude of his or her encumbrance to be acceptable following the surgery. A final score may be less affected by response shift and recall bias than a change score would, although that possible advantage is yet to be proven. Thresholds of success based on absolute post-operative scores at FU1 and FU2, respectively, were estimated as previously described by Tubach et al. [18] and recently by van Hooff et al. in a lumbar pain population [17]. No relevant differences could be found.
The number of patients reaching the thresholds of success was rather low in the LSS group-approximately 47-60% depending on the PROM. The cause might be the stringent definition of a successful outcome, leaving those that responded as 'somewhat better' and 'unchanged' as unsuccessful. Other contributing factors to this finding may be that (i) the LSS population is relatively older, with a larger prevalence of comorbidity and probably other painful degenerative conditions such as hip arthritis, that may confound the outcome of the surgery if measured by PROMs, (ii) the indication for surgery is too wide and (iii) the degeneration of the spine is often a multisegmental process that opens for future pain and disability. In this study, we chose not to make a distinction between LSS patients undergoing decompression surgery only and those treated with decompression and fusion surgery. The decision was based on the studies by  [19,20].
There appears to be no need for a 2-year follow-up of PROMs in this population and resources could be better spent on increasing the response rates at one year for registries, thereby improving the quality of the data and reducing the risks of assessment bias. Efforts could also be made to define certain groups of patients or procedures where a longer follow-up period such as 2 years may be needed-or shorter. The value of PROM assessment based on Swespine data collected at 5 and 10 years post-surgery is unclear and is yet to be studied. The more time has passed since the operation, the higher the risk of other health-related events aggravating the interpretation of the PROMs. More 'objective' endpoints, such as reoperation rates, cause of reoperation, time from index surgery to a new surgical event or in relation to implant survival, might be of greater importance than PROMs in the very long-term assessments.

Limitations
Although the strength of this study is the large real-life database, a selection bias may be present because of the proportion of non-respondents. In this study, the response rates were 75% at FU1 and 65% at FU2 and adjustments for risk factors associated with responding were not made. Solberg and colleagues concluded, however, that there were no differences in outcome between respondents and non-respondents in a population with degenerative lumbar disorders retrieved from the Norwegian spine registry and that the non-respondents could be treated as missing at random [21]. The same conclusion was later drawn in a similar study based on data from the Danish spine registry [22]. The high similarity between these countries should make the results applicable to the current study. The EQ-5D index is a measure designed for cost-effectiveness analyses and not for a similarity study such as this one. Therefore, the interpretation of the results involving EQ-5D should be made with caution. As in all studies where means are calculated and compared-longitudinally or cross-sectionally-there is a risk that patients worsening and improving neutralize each other to some extent. As illustrated in Fig. 3, this risk appears to be low. This study investigated whether there are any clinically important differences in PROMs between 1 and 2 years of follow-up or not. However, possible differences in outcome between 1 and 5 or 10 years are yet to be shown.

Conclusion
All PROMs used in Swespine in follow-up of degenerative lumbar spine surgery demonstrate minor differences between one and two years that can be expected in populations symptomatically treated for degenerative conditions. This suggests that when a PROM is the outcome variable, a one-year follow-up is sufficient in clinical effectiveness studies.