Introduction

Back pain is a leading cause of disability globally [1], however, pain is first and foremost an individual problem. Patient reported outcome measures (PROMs) are both scientifically and clinically used to assess patients’ subjective symptom burden. Generic PROMs generalize across groups and conditions, while condition specific instruments intend to be more definite. Hence, condition specific instruments are assumed to be more responsive and likely to detect clinical changes [2]. Around 200 PROMS for back pain exists, however, less questionnaires are warranted to compare results across studies and include results in meta-analysis [3].

Responsiveness is defined as the ability of a PROM to detect change in the measured construct over time [4]. This is important when assessing benefits of interventions. To interpret change scores, a frequently used benchmark is the minimal important change (MIC) [5]. MIC is the smallest change that patients perceive as important. Hence, MIC could aid the interpretation of PROM scores, and is a useful measure for both planning and evaluating clinical trials [5]. Responsiveness and MIC values for the frequently used PROMs for back pain patients diverge [6]. The deviation may be caused by different target populations [7,8,9] and different calculation methods (anchor-based or distribution-based methods [10]).

A national register was developed to improve health care services for neck and back pain patients in specialized care; the Norwegian Neck and Back Register (NNRR) [11]. This register collects data prior to all primal patient outpatient consultations and includes a 6- and 12-month follow-up regarding pain and functional development. The register includes demographic characteristics of patients, in addition to the following PROMs: Pain in activity reported on an 11-point numeric rating scale (NRSa) [6], Oswestry Disability Index (ODI) [12] and health related quality of life measured by EuroQol-5 (EQ-5D) [13] according to recommended core outcome measurements for low back pain [6, 12].

The aim of this study was to calculate responsiveness and MIC of NRSa (pain in activity), ODI and EQ-5D- among patients with low back pain in a Norwegian specialist care setting through both anchor-based and distribution-based methods.

Materials and methods

Design

The current study is a register-based cohort study originated from data from the Norwegian neck and back registry (NNRR) from 2021. NNRR is a national quality register established in 2012. Outpatient clinics in all Departments of Physical Medicine and Rehabilitation (PMR) in Norway receiving neck and back pain patients are obliged to deliver data to NNRR. All patients attending the outpatient clinics are invited, hence an unselected population reflecting normal clinical settings. The register collects data (digitally) prior the PMR baseline consultations, and after 6- and 12-months. For the current study we included back pain patients who fulfilled both follow-up forms and the patient global impression of change (PGIC) (Figs. 1 and 2).

Fig. 1
figure 1

Flowchart of participants responding to the Norwegian neck and back register

Patient reported outcome measures

Recently, the first study based on NNRR was published, with detailed description of PROMs and set-up of the register [11]. The current study includes pain intensity during activity during the past week (NRSa) reported on an 11-point numeric rating scale, ranging from 0 (lowest pain) to 10 (highest pain). Oswestry Disability Index (ODI) [12] which consists of 10 items, ranging from 0 to 5. The summed score is presented as a percentage where 0% indicates no pain related disability and 100% maximum pain related disability [12]. Health related life quality for the last 30 days measured through EQ-5D including five domains with 5 levels from no problem (1) to extreme problem (5). EQ-5D is reported on a scale from − 0.59 to 1, were 1 is the best health state, with a cut-off value of ≥ 0.75 considered as normal health-related life-quality. The Norwegian Medicine Agency recommends the utilizing the EQ-5D 3L algorithm for the UK [14].

Anchor for minimal important change

Patient Global Impression of Change (PGIC) scales are often as used an anchor to categorize patients’ improvement, deterioration or no important change [15]. At 6- and 12-month follow-up the patients were asked to rate the effect of the treatment received on a 7-point Likert scale were (1) was “I am all fine”, (2) “much improved”, (3) “minimally improved”, (4) “unchanged”, (5) “minimally worse”, (6) “much worse” and (7) “worse than ever.“ For the statistical analysis of the PGIC were trichotomized into three ordinal categories 1 to 3 as “improved”, 4 as “unchanged” and 4 to 7 as worsened. We also trichotomized 1 and 2 to “much improved”, 3 to 5 as “unchanged” and 6 and 7 to “much worsened”.

Statistics

The statistical analysis was performed with IBM SPSS Statistics version 28. We present descriptive data as means (standard deviations).

Responsiveness

We calculated responsiveness for both improvement and worsening with the “unchanged” group as reference group. Responsiveness was calculated using standardized response mean (SRM) and measuring area under the curve (AUC) for the receiver operating characteristic (ROC) curve. SRM was estimated by dividing the mean change score by the standard deviation (SD) of the change scores, in the improved, unchanged, and worsened group of patients. We used the cut-off SRM > 0.8 indicating large responsiveness, 0.5-0.0.8 moderate, and 0.2-<0.5 low. The AUC for the ROC curve (ROCAUC) represents the ability of the PROM to separate between patients who have improved and patients who have not improved. Spearman correlation coefficient > 0,30 was required between the anchor and the change score of the measurements [16]. Responsiveness can be considered adequate when the AUC is at least 0.7 [4]. The AUC was estimated with 95% confidence intervals.

Anchor-based minimal important change

We used the anchor based ROC-method to estimate the MIC values for each PROM [4]. ROC-curves were plotted as the combination of sensitivity and 1 – specificity for each change score. The sensitivity is the proportion of importantly improved patients, correctly identified by the PROM, and specificity is the proportion of patients with no important change correctly identified by the PROM. The three groups of patients according to the PGIC “improved”, “unchanged” and ”worsened”, will overlap in their change scores, and the optimal cut-off point is where the sum of proportions of misclassifications is smallest ((1- sensitivity) + (1-specificity)) or equivalently the point closest to the upper- left corner [5].

Distribution-based minimal important change

We also calculated the MIC according to the distribution-based approach using two methods.

  1. 1.

    0,5 SD: half the SD of the measured change in PROM scores [17].

  2. 2.

    The effect size: multiplying the SD of the baseline score by 0.2 (small effect size) [18].

Ethics

The Norwegian Directorate of Health has granted national status to the medical quality registry (NNRR) in 2011. The registry adheres to the Personal Protection Regulation article 6 no. 1 letter e (public interest) and follows the regulation on medical quality registry. An updated Data Protection Impact Assessment (DPIA) was in 2022 to ensure data security and privacy (2020/5880). Participation is based on written consent. A national expert group reviews all applications for data delivery to research purposes and approval from the Data Protection Office is required for the specific research project. Hence, this study was approved by the Data Protection Officer (22/07399) at Oslo University Hospital “and the data was delivered and stored in accordance with the regulations of both Oslo University Hospital and NNRR”.

Results

The patients responding to both follow-ups and PGIC at 6- and 12-months follow-up had similar gender, baseline ODI, NRSa, duration of symptoms, education, and employment rate (Table 1). However, the included patients were significantly older, had lower HSCL-10 and registered a Norwegians nationality. The means of the outcome measures are presented in Table 2.

Table 1 Baseline characteristics for included (n = 1617) and excluded patients (n = 2019)
Table 2 Baseline, 6-, and 12-months follow-up scores for the PROMs for included patients (patients with complete follow-ups

Apart from EQ-5D at 6 months, the change score from baseline to 6 and 12 months of all PROMs correlated with the anchor question over 0.3, (Table 3).

Table 3 Spearman correlation coefficients between the change score for instruments and anchor question. Patient

Patient global impression of change

Based on the responses to the PGIC, 42% of participants considered themselves as “improved” at the 6-months follow-up and 43% considered themselves “improved” at 12 months (Table 4). Correspondingly, 17% and 20% considered themselves “much improved” at 6 and 12 months, respectively. At 6 months 15% perceived a worsening, with 18% reporting worsening at 12 months. Likewise, 5% considered themselves “much worsened” at 6 months and 8% at 12-months follow-up.

Table 4 Distribution of the responses on the PGIC at 6- and 12-months follow-up (n = 1617)
Fig. 2
figure 2

Box-plot with a confidence interval of 95% for change in ODI and at 6- and 12-months follow-up. Oswestry Disability Index (ODI) (0-100), Pain in activity on a Numeric Rating Scale (NRSa) (0–10)

Responsiveness

ROCAUC estimates indicated adequate responsiveness for improvement for ODI (> 0.7) at both 6- and 12-months follow-up. Slightly higher ROCAU was documented when choosing only patients reporting much improved as anchor versus the group both improved and much improved (Table 5). NRSa was also responsive for both anchor groups at 6-months but not at 12 months follow-up (Table 5). ROCAUC for EQ-5D could only be calculated at 12 months follow-up. Responsiveness values for EQ-5D were only documented applying the “much improved” group as anchor. Low responsiveness was found for all measurements regarding worsening of the condition except for ODI for the much worse group. ROC curves for improved and much improved patients at 6- and 12-months follow-up are reported in Figs. 3 and 4.

Table 5 Responsiveness at 6- and 12-months follow-up. ROCAUC: area under the ROC curve
Fig. 3
figure 3

ROC curves of Oswestry Disability Index (ODI) and Pain in activity on a Numeric Rating Scale (NRSa) at 6 months follow-up for patients with improvement and much improvement

Fig. 4
figure 4

ROC curves of Oswestry Disability Index (ODI), Pain in activity on a Numeric Rating Scale (NRSa), EuroQol five-dimensional questionnaire (EQ-5D) at 12 months follow-up for patients with improvement and much improvement

The SRM values in the improved group indicates large responsiveness for ODI and NRSa at both follow-ups, for both improvement groups (Table 6). For EQD, the responsiveness through SRM had large responsiveness at 12-months, with moderate responsiveness for all improved patients. None of the SRM values for the worsened or much worsened had large responsiveness (Table 6).

Table 6 Responsiveness at 6- and 12-months follow-up

Minimal important change

The MIC values for improvement for NRSa varied from 0.4 to 2.5 depending on the calculation method (Table 7). Correspondingly, the MIC for ODI varied from 3.0 to 12.5 (Table 7).

Table 7 Minimal important change (MIC) calculations at different follow-ups. ROC

Discussion

To our knowledge, no previous studies have assessed responsiveness and MIC for NRSa, ODI and Eq. 5D in a register-based cohort of patients with back pain. We found that a condition specific outcome measure, ODIa was more responsive to change compared to the generic tools as NRSa and EQ5D. Additionally, we found variations in the calculated MIC values depending on the method applied.

Responsiveness

Overall, we found higher responsiveness for patients reporting “very much” and “much improvement” compared to “minimally improved”. Our results indicate adequate responsiveness for patients reporting improvement for ODI, consistent with previous research on low back pain [19]. NRSa only showed adequate responsiveness at 6-months follow-up. While frequently used for assessment of back pain, the reported responsiveness of NRS is inconsistent [20]. Lastly, EQ-5D had adequate responsiveness only at the 12-months follow-up and specifically in the much-improved group. A review of outcome measures in low back pain patients found inconsistent responsiveness of EQ-5D [6]. Hence, our findings correspond to prior research indicating that condition-specific PROMs are more responsive than generic tools [6].

NRSa was not responsive for worsening, while ODI and EQ5D were only responsive for patients reporting “much worse”. Patients are referred to specialized care when they have considerable pain and disability, and their symptoms are assumed to decline as a function of treatment and time. Furthermore, the treatment of the patients in NNRR is mostly conservative, reducing the probability of complication compared to a surgical setting. Hence, fewer patients are expected to experience deterioration, making responsiveness for worsening less important compared to a surgical context. Hence, we would argue that the present cohort is well suited for assessing responsiveness for improvement, but less for detecting declining function.

MIC

In the current study, we found that distribution-based method in general provided lower MIC values than the anchor-based method. In most cases the distribution-based calculations of MIC were lower than the measurement error for the instruments which questions the validity of the distribution-based MIC values [5]. Therefore, for the reminder of the discussion, we will as suggested by literature, focus on the anchor-based MIC [4] and MIC off “all improved” participants in the same range, we will focus on the “much improved”.

We found that the NRSa MIC-values for improvement was 2.5, which is in line with previously reported MIC-values for NRSa in back pain populations [7, 10, 21]. ODI varied from 9.5 to 12.5 at 6- and 12-months, which is in the lower range compared to other studies [7, 21]. MIC for Eq. 5D at 12 months was 0.12 which is also in the lower range of previous studies [7]. Hence, the MICs in our study are in the lower range compared to existing literature. This might be caused by our longer follow-up period [7, 10] and that our presented MIC-values reflect a heterogeneous back pain population independent of spinal care. Another possible reason is the trichotomization of change (improved, unchanged, and worsened). Most other studies only include a dichotomization (improved vs. non-improved), leading to an overestimation of the responsiveness and MIC.

Nevertheless, MIC values should be viewed with awareness concerning the measurement error of the PROM, the studied population and context [22]. As stated, there is probably no “gold standard” for MIC. MIC changes according to the population and context, hence, MIC should potentially be regarded as an indicative range rather than a fixed estimate [23].

Finally, the presented MICs increased from six to twelve month. To our knowledge, few studies have assessed MIC over several measuring points in spinal populations. However, a similar pattern appeared in a study assessing patients with plantar fasciopathy [24]. Potentially this reflects “responsive shift,” with patients downscaling the importance of a change as a function of time [24]. This also illustrates the challenges with the anchor based approach where varying response to the anchor may influence MIC values.

Over the years different terminology have been used to for calculating the difference patients perceive as important, e.g. MIC, Minimal important difference (MID), Minimal clinical Important Difference (MCID) [23]. In common for all definitions is the word “important”, however, what is an important improvement is open for discussion.

Strengths and limitations

A strength of the present study is the large number of participants giving a sample size well above the general recommendations for estimating MIC [5]. The heterogeneity of the present back pain population increases the external validity. Measuring points at both six and twelve months provides MIC values applicable at different time points. Anchor-based MIC-calculation were included, and all calculations are based on the recommended cut off values for correlations between instruments [5]. A weakness of the study is the number lost to follow-up, only 45% of the patients respond to both follow-ups. Furthermore, the anchor question may represent a source of MIC variation. The psychometrics of the anchor have not been examined, neither has the thrichotomization. When not addressing the specific neck or back condition, the patients may emphasize other constructs related to their health. Nevertheless, we believe that when this follow-up is part of a neck and back registry the patients will interpret it relating to their neck and back pain. However, as all other studies assessing responsiveness and MIC, recall bias is important to consider when evaluating an anchor-question [25].

Conclusion

Our estimated MIC thresholds will be helpful when evaluating effects of interventions offered to back pain patients in a clinical setting. In a register-based twelve-month follow-up of back pain patients we found that the condition-specific PROM ODI was more responsive to change than generic tools for pain and health related quality of life. NRSa and EQ55D. The variations in responsiveness and MIC estimates also indicate that they should be regarded as indicative, rather than fixed estimates.