Background

Lumbar degenerative conditions, including lumbar spinal stenosis, disc herniation and degenerative disc disease (DDD), are the most common reasons for elective lumbar spine surgery [1, 2]. Over the past two decades, the number of lumbar fusion operations has constantly increased worldwide [3,4,5,6].

The outcome of lumbar fusion surgery is often assessed with back-specific patient-reported outcome measures (PROMs) of disability, e.g. the Oswestry Disability Index (ODI). With these, patients rate their perceived limitations in performing various activities commonly affected by low back pain (LBP), such as walking, sitting and lifting [7, 8]. A benefit of back-specific PROMs is that they require little administration and let the patients convey their own view of their health status [9,10,11]. However, back-specific PROMs have shown low- to very low-quality evidence for content validity [12], meaning that it is not certain whether the activities in the PROMs are those that matter most to the patients themselves. Previous research and clinical experience also indicate discrepancies between patients’ scores on PROMs and how they actually perform activities when observed by others or as measured by wearable equipment (e.g. accelerometers) [13, 14].

Several authors have recommended the use of physical capacity tasks [13, 15,16,17,18,19], during which the patient performs a standardized activity in the clinic rather than self-reporting his/her ability to perform the activity [17]. An example of a physical capacity task is the timed up-and-go, which measures the time it takes for a person to rise from a chair, walk three meters, turn around, walk back to the chair and sit down [17]. Physical capacity tasks have been designed to measure what patients can do in a standardized environment, rather than what they think they can do, and, as such, they appear to capture important information about a patient’s functioning that PROMs do not [17, 20, 21]. Physical capacity tasks have also been suggested to be less influenced by language skills and education level than PROMs [10, 22, 23].

Outcome measures used in clinical practice and research should have sufficient evidence for reliability, validity, and responsiveness to avoid imprecise or biased results in the assessment of health interventions [24,25,26]. A recent systematic review showed that the physical capacity tasks 5-min walk, 50-ft walk, 1-min stair climbing, and timed up-and-go demonstrated moderate to strong evidence for reliability and validity [27]. However, the review also identified a lack of evidence concerning responsiveness. Responsiveness is one of the most important properties of an outcome measure since it signifies the ability to detect change over time [24]. It has been recommended that responsiveness is investigated by testing a priori hypotheses on expected associations with other instruments [24, 28]. The responsiveness hypotheses of the current study are presented in Table 1.

Table 1 A priori responsiveness hypotheses

It is also important to determine whether the change over time of an outcome measure is clinically relevant. The minimal important change (MIC), defined as “the smallest change score that patients perceive as important” [28], has been suggested to be a helpful parameter for this purpose [24, 29]. However, the MICs of physical capacity tasks for patients with chronic LBP have been rarely reported in the literature, not least for patients with chronic LBP who undergo lumbar fusion surgery.

The aim was to investigate the responsiveness and MIC of 5-min walk, 1-min stair climbing, 50-ft walk, and timed up-and-go in patients with chronic LBP undergoing lumbar fusion surgery.

Methods

This clinimetric study had a prospective design using data from a randomized controlled trial (RCT) [30].

Eligibility criteria

Eligible patients had motion-provoked chronic LBP with degenerative changes of 1–3 lumbar segments, were aged between 18 and 70 years, and were on the waiting list for lumbar fusion surgery [30]. The patients’ main surgical procedure was lumbar fusion surgery for back pain, but they could have minor radiating symptoms with or without a simultaneous surgery for isthmic spondylolisthesis, foraminal stenosis, or disc herniation. Patients with predominant radiculopathy, a rheumatic or neurological disorder, spinal malignancy, thoracolumbar deformities (e.g. idiopathic scoliosis) were excluded. Patients who had undergone decompression surgery for spinal stenosis or those who had a poor understanding of Swedish were also excluded.

Procedure

Patients were recruited at one university hospital and two private spine clinics in Sweden [30]. An orthopedic surgeon examined the patients and made a diagnosis, based on radiological and clinical findings. The clinic coordinators informed the physiotherapist responsible for patient recruitment when patients were placed on the waiting list. Patients were then contacted by the physiotherapist who informed them of the study and invited them to participate. Patients who were interested in study participation were scheduled for an appointment with an independent observer at one of the private spine clinics, 8–12 weeks before surgery. The independent observer provided the patients with oral and written information about the study. Patients who agreed to participate signed an informed consent form. The independent observer then instructed the patients to fill out PROMs and perform four physical capacity tasks (described below). The patients were then randomized to participation in either a prehabilitation program or conventional care prior to surgery. The prehabilitation program was based on the principles of person-centered care and had a cognitive behavioral approach [30]. The prehabilitation program comprised four preoperative treatment sessions and one postoperative booster session. In accordance with regional procedure, conventional care comprised a single session with a physical therapist. In this session, the patient received information about the post-operative mobilization routine and was introduced to a core exercise program that was initiated the day after surgery. Both study groups received the same physical therapy treatment in the ward after surgery [30]. In the current study, the patients were studied irrespective of the preoperative intervention assigned to them.

Follow-up assessments of the physical capacity tasks for the RCT occurred at 3, 6, 12, and 24 months after surgery [30], but for the purpose of the present study, only the data from baseline and the 6-month follow-up were used.

Sociodemographic variables and fear-avoidance variables for descriptive statistics

Data on age, gender, education, height and weight, back pain duration, previous back surgery, work status, and comorbidity were collected with the preoperative questionnaire used in the Swedish National Quality Registry for Spine Surgery (Swespine) [2]. The type of surgical procedure and the number of fusion levels were obtained from the patients’ medical journals. Fear of movement, depressive symptoms, and pain catastrophizing were assessed with the Tampa Scale for Kinesiophobia [31], the Hospital Anxiety and Depression Scale [32], and the Pain Catastrophizing Scale [33], respectively.

Physical capacity tasks

  • 5-min walk: The patient was asked to walk as fast as possible (without running) for a 5-min period [17]. The circuit was 30 m long and octagonal. The distance covered was recorded in meters.

  • 1-min stair climbing: The patient was asked to climb up and down a flight of stairs for one minute [19]. The staircase was straight with ten steps (16 cm high) and with handrails on both sides which the patient was allowed to use. The handrails were positioned too far apart to be used at the same time. The total number of steps was recorded.

  • 50-ft walk: The patient was instructed to walk as fast as possible (without running) until he/she came back to the starting point [17]. The circuit was 15 m (approximately 50 ft) long and figure-of-eight-shaped. The time needed to complete the test was rounded to the nearest 0.1 s.

  • Timed up-and-go: The patient was asked to rise up from a chair (seat 45 cm high, without armrests) as fast as possible, walk (without running) 3 m to a marked line on the floor, turn around, and walk back to the chair and sit down [17]. The time needed to complete the test was rounded to the nearest 0.1 s.

Five-minute walk, 50-ft walk, and timed up-and-go have demonstrated moderate to strong evidence for adequate test-retest reliability and construct validity [27]. One-minute stair climbing has demonstrated moderate evidence for adequate test-retest reliability [27].

Anchors in the responsiveness and MIC analyses

  • The Oswestry Disability Index 2.0 (ODI) was used to assess patient-reported disability [34]. The ODI has shown a moderate level of evidence of good reliability and construct validity for patients with chronic LBP [35].

  • A 100-mm visual analog scale (VAS) was used to assess the intensity of back pain over the last week. The reliability and validity of VAS in patients with chronic pain are supported by previous research [36].

  • At the 6-month follow-up, the patient filled out three 7-point construct-specific global perceived effect (GPE) scales on how he/she perceived his/her walking ability, stair climbing ability and chair rise ability to have changed from the baseline assessment to the 6-month follow-up: “much worse,” “worse,” “somewhat worse,” “unchanged,” “somewhat better,” “better,” and “much better” (eAppendix 1). Similar GPE scales have been shown to have good reliability and validity for patients with chronic LBP [37, 38].

  • At the 6-month follow-up, the patient filled out a 5-point generic GPE scale on how he/she perceived his/her back pain to have changed from before surgery: “worse,” “unchanged,” “somewhat better,” “much better,” “pain-free.” The scale has shown good responsiveness for patients with chronic LBP undergoing lumbar fusion surgery [39].

Statistical analysis

Statistical analyses were performed with IBM SPSS, version 24.0 (IBM Corp., Armonk, USA) and R, version 3.5.1 (R Foundation for Statistical Computing, Vienna, Austria). Descriptive statistics were used to characterize demographics and score distributions of the physical capacity tasks and the anchors. Continuous variables were presented as means with standard deviations in case of normal distribution, or medians with interquartile range otherwise. Categorical variables were presented as frequencies with accompanying percentages.

If a patient had missing data for a physical capacity task, the patient was excluded from all the analyses on that particular task. Patients who did not fill out the ODI, VAS, or any of the GPE scales were excluded from the analyses of the responsiveness hypotheses that included that particular outcome measure. In the case of missing data on the GPE scales, patients were also excluded from MIC analyses.

Responsiveness analysis

Responsiveness was investigated with a hypothesis-testing approach as recommended by the Consensus-based Standards for the Selection of Health Measurement Instruments (COSMIN) initiative [24]. Responsiveness in the present study was evaluated by testing the five hypotheses presented in Table 1. According to recommendations, an outcome measure is usually considered to have adequate responsiveness if at least 75% of the hypotheses are confirmed [40]: in this study, with five hypotheses, a criterion of at least 80% confirmed was adopted.

Hypothesis 1 was tested by calculating the area under the receiver operating characteristic (ROC) curve for improved and unchanged patients, as classified by the construct-specific GPE scales matched for each particular physical capacity task. The area under the ROC curve can vary from 0.5 to 1 and can be understood as the probability of correctly distinguishing improved patients from unchanged, with 1 indicating perfect ability to distinguish improved from unchanged patients [41]. For hypothesis 1, patients scoring “better” and “much better” on the construct-specific GPE scales (matched for each particular physical capacity task) were classified as improved and those scoring “somewhat worse,” “unchanged,” and “somewhat better” were classified as unchanged. Hypothesis 1 was accepted if the area under the ROC curve was ≥0.70 [40]. For timed up-and-go, hypothesis 1 was tested separately for the construct-specific GPE scales on walking and chair rise, since this task includes both of these activities.

Hypothesis 2 concerned the area under the ROC curve for improved and unchanged patients, as classified by the generic GPE scale. Patients scoring “much better” and “pain-free” on this scale were classified as improved, and those scoring “unchanged” and “somewhat better” were classified as unchanged. Hypothesis 2 was accepted if the area under the ROC curve generated by the generic GPE scale was lower than the area under the ROC curve generated by the construct-specific GPE scales. For timed up-and-go, hypothesis 2 was tested separately for the construct-specific GPE scales on walking and chair rise,

Hypotheses 3–5 were investigated with Spearman’s rho [42].

MIC analysis

MIC for deterioration was not calculated for any physical capacity tasks since few patients reported deterioration on the construct-specific GPE scales (n = 2). MIC for improvement was determined by the optimal cut-off point of the ROC curve based on the classification of improved and unchanged patients according to the construct-specific GPE scales (same dichotomization as for responsiveness hypothesis 1, described above), matched for each specific physical capacity task. The optimal cut-off point of the ROC curve represents the change score of each physical capacity task that yields the smallest number of misclassifications between improved and unchanged patients [43]. Since MIC can be highly influenced by baseline scores [44, 45], relative values were calculated in addition to absolute values. Relative MICs were calculated based on the ROC curve plotted with the percentage of change from baseline of each physical capacity task, and absolute MICs for improvement were calculated based on the ROC curve plotted with the absolute change from baseline for each physical capacity task. The 95% confidence intervals of the absolute and relative MICs for improvement were generated by taking the 2.5 and 97.5 percentiles of the distribution of 10,000 bootstrap samples [46]. This procedure was performed with the R library pROC [47]. Absolute and relative values for MICs for improvement for timed up-and-go were calculated separately for the construct-specific GPE scales for walking and chair rise.

The adequacy of using the construct-specific GPE scales as anchors for the responsiveness and MIC analyses was determined by calculating the correlation (Spearman’s rho) between the construct-specific GPE scales and the change scores of the physical capacity tasks. Previous research suggests that a correlation of at least 0.30 between an anchor and a change score of a measurement instrument is adequate [48].

Patient characteristics

Of the 118 included patients, 10 did not go through surgery. Of those undergoing surgery, 15 did not perform physical capacity testing at the 6-month follow-up. The number of patients included in each analysis of responsiveness and MIC for improvement is presented in Fig. 1. Table 2 shows the baseline characteristics of patients who completed the follow-up, and of the drop-outs. Patients in the drop-out group reported significantly higher levels for depressive symptoms, fear of movement, and pain catastrophizing than those completing the follow-up. The frequency of patients who reported disorders that affect walking ability was significantly larger in the drop-out group (four patients) compared with the patients who completed follow-up (two patients). Patients classified as improved by the construct-specific GPE scales had, on average, more favorable changes from baseline of the physical capacity tasks than unchanged patients (Table 3). Average scores for patients classified as deteriorated are not presented in Table 3 due to small sample sizes (n = 2).

Fig. 1
figure 1

Flowchart of patient inclusion and the number patients included in the responsiveness and minimal important change analyses

*One patient did not perform 5-min walk at the baseline assessment.

**The number of patients concerns the responsiveness analysis for hypotheses 2–5. In the data analysis for responsiveness hypothesis 1, only 57 patients were included due to missing data on the construct-specific global perceived effect scales. Moreover, since one patient had missing baseline data for 5-min walk, the number of patients in the responsiveness analysis for all hypotheses was one less for this task than for the others.

***For 5-min walk, 54 patients were included in the minimal important change analysis since one patient had missing baseline data for that physical capacity task

Table 2 Patient characteristics at baseline
Table 3 Baseline, follow-up, and change scores for physical capacity tasks of improved and unchanged patients

Results

Responsiveness

Hypothesis 1 was confirmed for 1-min walk, 50-ft walk, and timed up-and-go as the areas under the ROC curves generated with the construct-specific GPE scales were ≥ 0.70 for these tasks (Table 4). Hypothesis 2 was confirmed for 1-min walk, 50-ft walk, and timed up-and-go as they had larger areas under the ROC curves generated by the construct-specific GPE scales than those generated by the generic GPE scales (Table 4). In contrast, Hypotheses 1 and 2 were rejected for the 5-min walk. Hypothesis 3 was confirmed for all physical capacity tasks as the correlations among the tasks themselves were ≥ 0.50 (Table 5). Hypothesis 4 was confirmed for all physical capacity tasks as the correlations between the tasks and the ODI were consistently lower than the correlations among the tasks themselves. Hypothesis 5 was rejected for all tasks except for timed up-and-go.

Table 4 Area under the receiver operating characteristics curve and minimal important change for the physical capacity tasks
Table 5 Correlations between change scores of physical capacity tasks, Oswestry disability index, and visual analog scale on back pain intensity

In summary, one-minute stair climbing, 50-ft walk, and timed up-and-go displayed adequate responsiveness (80% of the hypotheses confirmed for 1-min stair climbing, 50-ft walk, and 100% for timed up-and-go), while 5-min walk did not (only 40% of the hypotheses confirmed) (Table 6).

Table 6 Results of Hypothesis-Testing for Responsiveness

Minimal important change

Of the 57 patients who completed the construct-specific GPE scales, two reported deterioration on the scales and were excluded from the MIC analyses. Absolute MICs for improvement were 45.5 m for 5-min walk, 20 steps for 1-min stair climbing, − 0.6 s for 50-ft walk, and − 1.3 s for timed up-and-go (Table 4). The sensitivity and specificity of the absolute and relative MICs for improvement are presented in Table 4. As reference values to the MICs for improvement, Table 4 gives the mean change scores of the physical capacity tasks, indicating the change of the “average” patient.

Adequacy of using the construct-specific GPE scales in the responsiveness and MIC analyses

The correlation between the construct-specific GPE scales and the change scores of the physical capacity tasks were all above the recommended threshold value of 0.30 [48], which supports the adequacy of using the scales in the responsiveness and MIC analyses (Table 4).

Discussion

The present study was one of the first to assess responsiveness and MIC of physical capacity tasks for patients with chronic LBP undergoing lumbar fusion surgery. One-minute stair climbing, 50-ft walk, and timed up-and-go displayed adequate responsiveness with ≥80% of the responsiveness hypotheses being confirmed, while five-minute walk displayed inadequate responsiveness. The positive results of responsiveness for 1-min stair climbing, 50-ft walk, and timed up-and-go suggests that these physical capacity tasks have the ability to detect changes in physical capacity over time in patients who undergo lumbar fusion surgery. The absolute MICs for improvement for 5-min walk, 1-min stair climbing, 50-ft walk, and timed up-and-go were 45.5 m, 20.0 steps, − 0.6 s, and − 1.3 s, respectively.

In line with our results, Gautschi et al. found adequate responsiveness for timed-up-and-go [49]. Gautschi et al. investigated the responsiveness of timed up-and-go for a mixed study sample of patients with lumbar spinal stenosis, lumbar disc herniation, and chronic LBP due to DDD undergoing various types of lumbar spine operations. In concordance with our findings, Andersson et al. found that one-minute stair climbing had adequate responsiveness [50]. Furthermore, the authors of that study found that five-minute walk had inadequate responsiveness, also in line with our results. Andersson et al. reasoned that the finding might be a result of the possibility that the task was not challenging enough for patients with chronic LBP. Patients might therefore only show small improvements in this task after an intervention, which could limit the task’s responsiveness.

In contrast to our results, Andersson et al. [50] and Strand et al. [51] found that 50-ft walk had inadequate responsiveness. The differences in results might be because of dissimilarities in patient characteristics. Andersson et al. [50] and Strand et al. [51] included patients with non-specific chronic LBP who underwent non-surgical interventions. Patients with chronic LBP undergoing lumbar fusion surgery in the current study had motion-elicited back pain, so that they can have difficulties with quick movements of the spine. As such, 50-ft walk could be challenging for these patients, the task requiring them to make a quick turn after having walked 25 ft. In contrast, the patients in the two previous responsiveness studies [50, 51] may have found the task less challenging. Second, Andersson et al. did not use a hypothesis testing approach to evaluate responsiveness [50], which could also explain why their results differed from ours.

The MICs for improvement in the current study might be used by researchers and clinicians as reference values when interpreting patients’ postoperative change scores [24, 43]. In research, the MICs for improvement could, for example, be used to evaluate the proportion of “responders” to treatment, where patients with change scores larger than the MIC values are classified as responders [52]. It is, however, important to acknowledge that the MIC is a group-based statistic and that the value for MIC might not always reflect an individual patient’s view of the change [53]. Thus, when comparing an individual patient’s change score with the current study’s MICs for improvement in clinical practice, it is essential to interpret the change score in relation to the patient’s reported experience and not just the MIC. Comparing individual change scores with the MICs might, for instance, serve as a reference for what the “average” patient finds important and could possibly aid the shared decision-making process in the patient’s postoperative rehabilitation. However, the 95% confidence intervals of the MICs were wide, and they should therefore be viewed with some caution.

In order to detect changes as small as the MIC, it is important that the MIC is larger than the smallest detectable change (SDC), defined as “the smallest change that can be detected by the measurement instrument, beyond measurement error” [24]. The MIC of 1-min stair climbing in the present study is larger than the SDC (derived from the limits of agreement) in Smeets et al. [19], which suggests that when a patient scores change equal to or greater than the MIC this is indeed an important change and unlikely to be due to measurement error. In contrast, the MICs for improvement of 50-ft walk and timed up-and-go in the present study are below the smallest detectable change given in previous studies [17, 19], meaning that observed changes could be due to measurement error and not reflect important and real changes. As the SDCs have only been assessed in patients with chronic LBP who undergo conservative treatment [17, 19], future studies should investigate the SDC specifically for patients with chronic LBP who undergo lumbar fusion surgery.

A strength of the present study is that it is one of the first to investigate the responsiveness of physical capacity tasks by testing a priori hypotheses. Using a hypothesis testing approach in the assessment of responsiveness has been recommended by experts in clinimetrics since it minimizes bias in the interpretation of the results [24, 28]. Another strength of the study is that we used an anchor-based method (the optimal cut-off point of the ROC curve) to determine MICs. Anchor-based methods have been recommended over so-called distribution-based methods, such as the standardized response mean or other effect size parameters [24]. Moreover, we used construct-specific GPE scales rather than generic GPE scales in the anchor-based method since previous research implies that construct-specific GPE scales generate better approximations of MICs than do generic ones [37]. However, there is no consensus on the optimal method for determining MIC. For instance, the so-called predictive modeling approach has been shown to be a good alternative to the optimal cut-off point method [54]. Research also suggests that MIC estimates may be biased when the proportion of improved patients is higher than 50% (in our study, 60% of the patients were improved) [55]. Consequently, future studies using other methods for determining MIC and also adjusting for the proportion of improved patients might provide better estimates than in the current study.

A limitation of this study is that a large proportion (36 patients) of those whom attended the follow-up visits did not fill out the construct-specific GPE scales. The reason for this is that MIC was first planned to be investigated with a generic GPE scale [39] instead of the construct-specific ones. However, during the course of this study, other studies showed that construct-specific GPE scales seemed to be more suitable for determining MIC [37, 56], and we therefore decided to use this type of scales instead. A natural consequence of this decision is that the results for Hypothesis 1 and the MICs for improvement had less statistical power than the other analyses, which is reflected in the wide confidence intervals of these MICs.

Another limitation could be potential selection bias since the patients were a part of an RCT. Patients with higher preoperative levels of disability and pain intensity may have declined study participation as the RCT required patients to travel to one of the spine clinics to see a physical therapist before surgery [30]. This could be the reason why the study sample reported a slightly lower disability level and back pain intensity compared with patients in Swespine undergoing surgery for chronic LBP due to DDD [2]. However, our study sample had similar characteristics as patients in Swespine in terms of age, duration of symptoms and proportion of men and women. It is therefore reasonable to assume that our findings are generalizable to most patients undergoing lumbar fusion surgery for chronic LBP, but possibly not for those with the highest preoperative levels of disability and pain intensity.

Conclusions

The results of responsiveness imply that 1-min stair climbing, 50-ft walk, and timed up-and-go they have the ability to detect changes in physical capacity over time in patients with chronic LBP who have undergone lumbar fusion surgery. In contrast, the 5-min walk showed inadequate responsiveness for this patient group.