Introduction

Lumbar spinal stenosis (LSS) is a common disease in the adult spine caused by degenerative changes narrowing the lumbar spinal canal, causing compression of neurovascular structures. It can occur with or without degenerative spondylolisthesis. LSS is associated with pain and reduced physical function, including walking problems, and is the most frequent reason for lumbar surgery increasing in numbers in the ageing spine [1, 2]. The optimal way of treating these patients non-surgically or/and surgically is still debated [3,4,5,6].

Reliable, valid, and responsive outcome measurements are cornerstones in evaluating the disease progress in patients regardless of treatment. There are two major aspects of responsiveness: “internal responsiveness”, characterizes the ability of a measure to change over a prespecified time frame, and “external responsiveness”, reflects the extent to which change in a measure relates to a corresponding change in a reference measure of clinical or health status [7]. Patient reported outcome measures (PROMs) use various questions to reflect the severity of the condition or disability level. In patients with LSS, the most used PROM is the Oswestry Disability Index (ODI) [8, 9], designed to explain how back or leg pain affects the ability to function in everyday life. ODI has been widely used in studies of patients with lumbar spinal disorders treated both surgically and non-surgically. The Zurich Claudication Questionnaire (ZCQ), also named the Swiss Spinal Stenosis Scale [10], is developed specifically for patients with lumbar spinal stenosis. However, ZCQ is less investigated, and few studies in patients with LSS have applied both PROMs [9, 11].

This study aimed to evaluate the responsiveness of the ODI and ZCQ, and investigate the cut-off values in ODI and ZCQ used for defining clinical “success” for surgically treated patients with LSS.

Materials and methods

Study design

This is a secondary analysis of prospective data derived from the NORwegian Degenerative spondylolisthesis and spinal STENosis (NORDSTEN) multicentre study (18 public hospitals located in all regions of Norway). This current study was based on baseline and two-year follow-up data from the surgically treated patients in the NORDSTEN-SST (patients were randomized to three different decompression techniques) and the NORDSTEN-DS (patients were randomized to surgical decompression alone or decompression with instrumental fusion). A detailed description of these trials results after two-year follow-up have been published previously [4, 5].

Participants

Inclusion criteria were clinical symptoms of LSS defined as neurogenic claudication or radiating pain into the lower limbs, not responding to at least three months of non-surgical treatment; radiological findings corresponding to the clinical findings; men and women age > 18, ≤ 80 years. A single-level spondylolisthesis ≥ 3 mm gave participation in the NORDSTEN-DS. Exclusion criteria were former surgery at the level of stenosis; former fracture/fusion of the thoracolumbar spine; cauda equina syndrome or fixed complete motor deficit; ASA grade 4 or 5; lumbar scoliosis > 20 grades; stenosis in more than three lumbar levels; distinct symptoms in lower limbs due to other diseases. Inclusion and exclusion criteria for the NORDSTEN trials are previously described in detail [4, 5, 12]. Furthermore, to be included in the present study, patients must have responded to the ODI and ZCQ at inclusion and ODI, ZCQ and GPE scale at two-year follow-up (Fig. 1).

Patient-reported outcome measures

The primary outcome measure for the NORDSTEN trials is the Norwegian-validated version of ODI version 2.0 [13]. The Norwegian-validated version of ZCQ was a secondary outcome measure [14].

The ODI was proposed in 1980 [15] and was originally developed to assess chronic low back pain. The ODI version 2.0 [8] also includes leg symptoms. There are ten questions addressing different aspects of function. Each question is scored on a 6-point Likert response scale ranging from 0 to 5. An index (0-100) is calculated, where a higher score represents higher levels of disability.

The ZCQ was developed in 1996 [10] to measure symptoms associated with LSS specifically. The tool assigns a higher priority to leg symptoms than to low back pain. The ZCQ consists of three scales addressing symptom severity, physical function, and patient satisfaction, the latter only relevant at follow-up. On a Likert response scale, each question scores from 1 to 4 or 5. Each scale score is calculated as a mean of all answered questions and a higher score represents a higher level of disability.

Two years after surgery, the patients also responded to the Norwegian-validated version of the GPE scale [16] to estimate patient-reported overall judgement of the outcome of the surgery. There were seven response choices: ‘completely recovered’, ‘much improved’, ‘slightly improved’, ‘unchanged’, ‘slightly worse’, ‘much worse’, or ‘worse than ever’.

Statistical analysis

Standard descriptive statistics were presented using mean and standard deviation (SD) or median and interquartile range (IQR) for continuous variables and absolute frequencies and percentage distribution for categorical variables. In addition, histograms and box plots were used to illustrate the distribution of outcome measures, also in relation to the GPE scale. Internal responsiveness was evaluated using effect size and standardized response mean (SRM). The effect size was calculated as the mean difference between pre- and postoperative outcome measures divided by the standard deviation of the measure preoperatively. SRM was defined as the mean change divided by the standard deviation of the change and was calculated for both absolute and relative change in outcome measures. Effect sizes and SRMs greater than 0.80 were considered large. External responsiveness was evaluated by estimating Spearman’s rank correlation between outcome measures and the GPE scale. Correlation coefficients in the range of 0.5–0.67 were considered moderate, and larger correlation coefficients were considered strong. External responsiveness was also evaluated using receiver operating characteristics (ROC) curves and corresponding area under the curve (AUC) using the GPE scale dichotomized into “success” vs “non-success” (‘completely recovered’/’much improved’ vs. remaining categories) as an external anchor. The higher the AUC, the higher the ability of a continuous measurement to discriminate patients into a successful or non-successful outcome after treatment. AUCs were considered moderate (0.70–0.79), high (0.80–0.89) or excellent ( > = 0.90). We evaluated the sensitivity, specificity and correct classification rates of all possible cut-offs to evaluate clinically appropriate cut-off values. All analyses were done using Stata version 17.1.

Results

In the NORDSTEN study, 704 LSS patients were included and treated surgically between February 2014 and September 2018: 437 in the NORDSTEN-SST and 267 in the NORDSTEN-DS trial [4, 5, 12]. The present analysis excluded 103 patients due to missing or incomplete ODI, ZCQ, or GPE scale preoperatively and/or at two-year follow-up. Consequently, 601 (85%) patients were included in the present study (Fig. 1).

Fig. 1
figure 1

Flowchart of the included surgically treated lumbar spinal stenosis patients from the NORDSTEN-trials. LSS: Lumbar spinal stenosis, NORDSTEN: NORwegian degenerative spondylolisthesis and spinal STENosis, ODI: Oswestry Disability Index, ZCQ: Zürich claudication questionnaire, GPE: Global perceived effect scale. * = included two dead and two withdrawn consents, ^=included three dead and one withdrawn consent

The demographic data and patients’ characteristic at baseline are shown in Table 1.

Table 1 The demographic data, general health condition, and clinical scores at baseline in the NORDSTEN-study for the surgically treated lumbar spinal stenosis patients

The outcome two years after surgery is shown in Table 2. Patients rated outcome as ‘completely recovered’/’much improved’ in 392 (65.3%) cases.

Table 2 Outcome after surgery for lumbar spinal stenosis assessed by patients at two-year follow-up in the NORDSTEN trials: global perceived effect scale (GPE scale), Oswestry disability index (ODI), and Zurich claudication questionnaire (ZCQ)

Figure 2 shows the distribution of preoperative scores and scores at two-year follow-up for ODI and ZCQ. It also shows the absolute and relative PROM changes from surgery until two-year follow-up. Data for the NORDSTEN-SST and NORDSTEN-DS separately are found in supplementary information Fig. 1.

Fig. 2
figure 2

Histograms showing the distribution of preoperative scores, follow-up scores and changes (absolute and relative) in scores for Oswestry disability index (ODI) and Zurich claudication questionnaire (ZCQ) symptom severity and physical function in patients surgically treated for lumbar spinal stenosis in the NORDSTEN trials. ODI: Oswestry disability index (0-100) a higher score represents higher levels of disability, ZCQ-symptom: Zürich claudication questionnaire symptom severity (1–5) a higher score represents a higher level of disability, ZCQ-function: Zürich claudication questionnaire physical function (1–4) a higher score represents a higher level of disability

ODI and ZCQ scores after two-year follow-up, as well as absolute and relative changes from baseline, are summarized in Table 2 and graphically described within each patient group based on response to the GPE scale in Fig. 3. Data for the NORDSTEN-SST and NORDSTEN-DS separately are found in supplementary information Fig. 2.

Fig. 3
figure 3

The follow-up scores, absolute change scores, and relative change scores at two-year-follow-up for Oswestry disability index (ODI) and Zurich claudication questionnaire (ZCQ) symptom severity and physical function in each Global Perceived effect scale (GPE) group for patients surgically treated for lumbar spinal stenosis in the NORDSTEN trials. GPE scale groups: CR (Completely recovered), MI (Much improved), SI (Slightly improved), NC (Not changed), SW (Slightly worse), MW (Much worse), WE (Worse than ever), ODI: Oswestry disability index (0-100) a higher score represents higher levels of disability, ZCQ-symptoms: Zürich claudication questionnaire symptom severity (1–5) a higher score represents a higher level of disability, ZCQ-physical function: Zürich claudication questionnaire physical function (1–4) a higher score represents a higher level of disability

Internal responsiveness

Effect sizes and SRMs were large (> 0.8) and similar across all outcome measures. Effect sizes ranged from − 1.39 for ODI and − 1.79 for ZCQ symptom severity score. In comparison, SRM ranged from − 1.19 for absolute change in ODI and − 1.25 for absolute change in ZCQ symptom severity score (Table 3).

External responsiveness

All outcome measurements correlated significantly to the GPE scale. ODI, ZCQ symptoms severity, and -physical function all had strong or close to strong Spearman’s correlation coefficients (≥ 0.67) for follow-up scores and relative change score (Table 3), whereas it was moderate (≥ 0.50) for absolute change score.

ODI, ZCQ symptom severity, and -physical function all had high test accuracy (AUC from 0.8 to 0.9) in ROC curves (Table 3; Fig. 4), but somewhat higher for the follow-up and relative change scores than for the absolute change scores. Data for the NORDSTEN-SST and NORDSTEN-DS separately are found in supplementary information Fig. 3.

Table 3 Responsiveness for Oswestry disability index (ODI) and Zurich claudication questionnaire (ZCQ) for surgically treated lumbar stenosis patients. Internal reponsiveness was assessed by effect size and standard reponse mean (SRM), and external responsiveness by Spearman rank correlation coefficient, and the receiver operating characteristics area under curve (ROC AUC) using the global perceived effect scale (GPE) as the external criterion
Fig. 4
figure 4

The receiver operating characteristics (ROC) and area under the curve (AUC) using the Global Perceived effect scale (GPE) as the external criterion. The follow-up scores, absolute change scores, and relative change scores for Oswestry disability index (ODI) and Zurich claudication questionnaire (ZCQ) for patients surgically treated for lumbar spinal stenosis in the NORDSTEN-trials

Cut-off values for defining clinical “success”

The correct classification rate for all possible cut-off values and the corresponding sensitivity and specificity for defining “success”/”non-success” for follow-up scores, absolute change scores, and relative change scores are illustrated in Fig. 5. For relative change scores with varying sensitivity and specificity, all ODI cut-offs between 27.5% and 49% did correctly classify 81 to 82 per cent of the patients. For ZCQ symptom severity, the correct classification score was 78–79 per cent in the interval between 30 and 38%, and for ZCQ physical function, 80–81 per cent in the interval between 35 and 50% relative change. The figure illustrates that regardless of chosen combination of sensitivity and specificity, the maximum accuracy was lower for the absolute change score than for the follow-up score and the relative change score.

Fig. 5
figure 5

Illustrating possible cut-off values for Oswestry disability index (ODI) and Zurich claudication questionnaire (ZCQ) symptom severity and - physical function defining “success” after surgical treatment of lumbar stenosis patients (“success"= Global Perceived effect scale (GPE) “completely recovered” and “much improved”). The proportion correctly classified patients balanced against sensitivity and specificity. The curves for follow-up score, absolute change score and relative change score are shown

Discussion

In the present study, ODI and ZCQ showed good responsiveness for assessing clinical outcomes in patients treated surgically for LSS with or without degenerative spondylolisthesis. The internal responsiveness was good for both tested PROMs. The external responsiveness and the ability to discriminate between “success” and “non-success” were strong for follow-up and relative change scores, whereas they were moderate for absolute change scores. The 30% threshold for the ODI relative change used in the NORDSTEN trials was within the range of cut-off values with accuracy.

The two PROMs, ODI and ZCQ, are specific for spinal conditions. Since ODI was originally developed for patients with low back pain, and ZCQ was developed for LSS patients, the finding of similar and good responsiveness for both in surgically treated LSS patients is interesting.

Follow-up score, absolute change score, and relative change score for the PROMs are alternative response parameters used in evaluations. The relative change (percentage) score has been recommended to account for the influence of the baseline score on the outcome score [17,18,19,20]. In the present study, the follow-up score and the relative change score performed better than the absolute change score, which is in accordance with a previous study from The Norwegian Spine registry [20].

Internal responsiveness was good. We found a large effect size and SRM for both ODI and ZCQ, of the same magnitude as Fujimori et al.‘s investigation of patients operated for LSS at one-year follow-up [11]. In the present study, external responsiveness evaluated by Spearman correlation coefficients was higher than that reported by Fujimori et al. (coefficients around 0.50), whereas the ROC test accuracy was similar.

Contrary to a previously published article on this topic [11], we do not provide a definitive ranking of the instruments compared. Such rankings might be sensitive to random variation and misleading because one puts too much confidence in one instrument being better than others. We intentionally focus on the similarities between ODI and ZCQ, both in numerical results and concerning clinical relevance and usefulness.

Formerly published cut-off values for ODI and ZCQ defining a clinical, minimal, or substantial important difference have been calculated with various methods and based on follow-up, absolute, or relative change scores. Different external anchors have been used; however, the patient’s perceived global assessment of outcome or satisfaction (GPE-scales) is the most used. Patient response has been given on a five- or seven-point Likert scale, and the anchor has been the two or three best answer options. The calculations have been based on different study designs, such as clinical or register studies, and the time for follow-up has varied. In addition, most studies have used heterogeneous cohorts of various spine conditions. Therefore, a wide range of cut-off values have been proposed, and comparison is difficult. The present paper’s results demonstrate that a range of cut-off values gave similar results for the proportion of correctly classified patients, and these must be balanced against sensitivity and specificity.

Previous reports of the clinically important cut-off values for ODI in surgically treated LSS patients were in the range of possible cut-off values reported in the present study both for follow-up scores [20, 21] and for absolute change scores [11, 17, 20]. A report about the cut-off value for absolute change score for ZCQ symptom severity and physical function [11] seemed to be lower than in the present study. The explanation might be that they used a five-point, not seven-point, Likert scale. In reports from others, the relative change for ODI has been suggested between 10 and 40% [19, 20]. The 30% threshold for defining treatment as a clinical” success” recommended and predefined in the NORDSTEN trials, is based on a registry study with one-year follow-up [20] and a “gathering to consensus” paper [17], is within accuracy. In the registry study [20], the anchor was completely recovered/much improved. Because of the strict anchor (not including slightly improved), high sensitivity was favoured to ensure the detection of true positive “successes”. The present study showed that the 30% ODI cut-off was within the interval giving a high correct classification rate. However, the percentage of correctly classified patients would also be high using higher cut-offs than 30%. When planning future comparative studies, one should consider not only using one cut-off but also performing sensitivity analyses using different cut-offs with high accuracy. It may also be reasonable that patients’ expectations are higher during a clinical study with follow-up than when they are part of a registry (more disappointed and answering “slightly improved”). In the registry, there will also be a more heterogeneous patient population.

Strength and limitations

The present analyses were based on a large cohort of surgical patients with a high follow-up rate. International guidelines for outcome measures were followed, along with translated and validated PROMs.

There is limited consensus about the best anchor for measuring changes in disease severity by PROMs. We selected the GPE scale since it has been commonly used and recommended [18, 22]. There is advice to use a seven-point rating scale of change and setting the cut-off for clinically relevant improvements between “much improved” and “slightly improved” [19, 22]. The present paper’s GPE scale was about outcomes, but some studies have also used satisfaction [21]. Despite being commonly used, the GPE scale also has some weaknesses [16, 19, 20]; there is a possible recall bias in responding two years after the surgical intervention, the scale is domain unspecific, one does not know what kind of deterioration patients had in mind, or if other diseases were interfering, or if patients were more satisfied with care than treatment, and in addition, variation in mood may influence the patient’s response. Another critical concern is that in evaluating the PROMs (ODI, ZCQ), another subjective measurement (GPE scale) was used. In the present study, both tested PROMs correlated moderately to strongly with the GPE scale, as should be expected. Fujimori et al. found a discrepancy between the questionnaires’ improvement and the GPE scale [11]. Since they used a five-point Likert scale for GPE, it might be harder to reveal improvement. Recall bias and individual expectations of outcome may also play a role in a cultural frame.

Strictly, we did not ask the patients if the change was clinically important or a “success”. Still, we considered that the anchor answers, “completely recovered” and “much improved” were indicators of a significant improvement at follow-up. These concepts were discussed in some recent studies [23, 24].

Patients lost to follow-up represent a potential source of bias. In the present study, there was < 10% lost to follow-up, which made a high risk of bias unlikely. Also, a recent study based on data from the Norwegian Spine Registry showed that non-respondents had similar clinical outcomes [25].

These present analyses provided corroboration for the responsiveness of two commonly used outcome measurements in a large sample of patients with LSS treated surgically. Even though the responsiveness was comparable, when choosing an instrument for a study, one should remember that these PROMs were developed for different purposes. Our results for surgically treated LSS patients may not be reproduced if the patients, for instance, had some conservative treatment. ODI having as good responsiveness as ZCQ in the present study might be related to surgery preceding the observed change in scores for the included patients. Furthermore, ZCQ focuses on all symptoms in lower limbs and walking trouble, whereas ODI measures the influence of back and leg pain on daily life function.

Conclusion

ODI and ZCQ demonstrate comparable responsiveness in evaluating clinical outcomes for surgically treated LSS patients. To reflect “success” the follow-up and relative change score seems more accurate than the absolute change score. The 30% ODI threshold defining treatment success in NORDSTEN trials favoured sensitivity over specificity and was within accuracy.