ISSLS prize in clinical science 2020: the reliability and interpretability of score change in lumbar spine research

A statistically significant score change of a PROM (Patient-Reported Outcome Measure) can be questioned if it does not exceed the clinically Minimal Important Change (MIC) or the SDC (Smallest Detectable Change) of the particular measure. The aim of the study was to define the SDC of three common PROMs in degenerative lumbar spine surgery: Numeric Rating Scale (NRSBACK/LEG), Oswestry Disability Index (ODI) and Euroqol-5-Dimensions (EQ-5DINDEX) and to compare them to their MICs. The transition questions Global Assessment (GABACK/LEG) were also explored. Reliability analyses were performed on a test–retest population of 182 symptomatically stable patients, with similar characteristics as the Swespine registry population, who underwent surgery for degenerative lumbar spine conditions 2017–2018. The MIC values were based on the entire registry (n = 98,732) using the ROC curve method. The ICC for absolute agreement was calculated in a two-way random-effects single measures model. For categorical variables, weighted kappa and exact agreement were computed. For the NRS, the SDC exceeded the MIC (NRSBACK:3.6 and 2.7; NRSLEG: 3.7 and 3.2, respectively), while they were of an equal size of 18 for the ODI. The gap between the two estimates was remarkable in the EQ-5DINDEX, where SDC was 0.49 and MIC was 0.10. The GABACK/LEG showed an excellent agreement between the test and the retest occasion. For the tested PROM scores, the changes must be considerable in order to distinguish a true change from random error in degenerative lumbar spine surgery research. These slides can be retrieved under Electronic Supplementary Material.

Extended author information available on the last page of the article Numeric Rating Scale (NRS-pain) and the Euroqol-5-Dimensions (EQ-5D-quality of life) [1]. When a PROM is used repeatedly on the same patient, a measurement error will be present because of natural fluctuations in symptoms, variation in the measurement process, or both. A useful way of presenting the measurement error is the Smallest Detectable Change (SDC). It is described by Polit and Yang as a change in score of sufficient magnitude that the probability of it being the result of random error is low [2]. In trials, where a measurement of change is involved, it is practical to refer to a repeatability parameter such as the SDC, which is in the units of the PROM in question.
The SDC is a measure of the reliability of a PROM, based on the measurement error and repeatability of each instrument. Recently published reviews found that studies exploring such measurement properties were few and of inadequate quality [3][4][5].
A statistically significant change in outcome does not necessarily mean that it is of interest in real life. A person's opinion about the smallest score change is named the Minimal Important Change (MIC) [6]. For many years, there has been a conceptual confusion around the many measurements of change parameters defining the cut-off in a PROM score that distinguishes a success from failure [7][8][9][10][11][12]. Terwee et al. [13] have emphazised the important link between SDC and MIC.
The aim of this study was to define the SDC in the most commonly used outcome measures in degenerative lumbar spine surgery and compare them to the MIC.

Outcome variables
The Numeric Rating Scale for back and leg pain, respectively, (NRS BACK/LEG ), the Oswestry Disability Index (ODI), version 2.1a, and the European Quality of life questionnaire (EQ-5D INDEX ) are well known and described in detail earlier [1].
The Global Assessment of back and leg pain, respectively, (GA BACK/LEG ) [14] assesses patients' retrospective perception of treatment effect. The question is worded: "How is your back/leg pain today as compared to before you had your back surgery?" with 6 response options: 0/Had no back/leg pain, 1/Completely pain free, 2/Much Better, 3/Somewhat Better, 4/Unchanged, 5/Worse.
The first question of the Short-Form 36 questionnaire (SF36 GH ) [15] was added to reveal changes in global health during the retest period. The question is worded: "In general, would you say your health is" with response options: Excellent/Very Good/Good/Fair/Poor.

The MIC population
MIC computations were based on the entire Swespine register [16]. Table 1 presents anthropometrics, baseline  data and 1-year follow-up of the degenerative lumbar spine  population, operated 1998-2017 (n = 98,732). Adults, with either of the three degenerative diagnoses, lumbar disk herniation, lumbar spinal stenosis or degenerative disk disease, were included.

The retest population
The study participants were collected consecutively at Stockholm Spine Center and Spine Center Göteborg between November 2017 and May 2019. In order to cover as much of the range of each PROM scale as possible, they were collected from both the waiting list (pre-op group) and from those followed up 1 year after surgery (post-op group). At least 30 individuals from each of the three diagnoses groups were obtained.   The pre-op group filled out the first booklet (T1) at the clinic on the day they were listed for surgery. The second booklet (T2) was sent by mail 1 week later, and the respondents were asked to return the form within 5 days. One reminder was sent after 1 week.
In the post-op group, a request for study participation was added to the 1-year Swespine follow-up booklet (T1). One week after the booklet was registered at the Swespine office, the second questionnaire (T2) was sent out by mail, with a request to return the form within 5 days. Inclusion to the pre-op group stopped as the total number of participants exceeded 30 in all three diagnoses. For the analyses, the pre-op and the post-op groups, as well as the diagnoses, were merged.
The time interval between the two points of estimation, T1 and T2, was within 10 to 35 days. The difference in PROM score for each participant between T1 and T2 was plotted against the time interval and correlated in Spearman rank analyses to check whether the number of days between T1 and T2 had an influence on the PROM score or not.
The occurrence of systematic differences between T1 and T2 was examined using the Sign test for categorical data (i.e., GA BACK/LEG and SF 36 GH ) and the Wilcoxon's sign rank test for continuous data (i.e., NRS BACK/LEG , ODI and EQ-5D INDEX ).
A maximum of two missing items was accepted for the ODI and zero missing items for the remaining PROMs, according to published score algorithms [17,18].
The study was conducted according to the COSMIN checklist, boxes B, C, and J [6].
Descriptive data are presented as means (± SD) or numbers (%).

MIC
The MIC estimates were previously calculated for the diagnosis groups LDH, LSS, and DDD [19] using the anchorbased ROC curve method [20]. In the current study, MIC values without stratification for diagnosis were added. The measure used as gold standard was the GA, which has been shown to have an acceptable correlation to the instruments at issue [14]. Patients' self-assessments on the GA as either "pain free" or "much better" was considered an important improvement (i.e., equal to, or above the MIC). The ability of each PROM to distinguish between improved and not improved was measured by the Area Under the ROC Curve (AUC), with an acceptable level of 0.70. The cut-off score defining the MIC also represents the level where the sensitivity and specificity of the PROMs are mutually maximized. The probability that a patient reaching the MIC will also express an important improvement on the GA is called the positive predictive value (PPV). The probability that a patient not reaching the MIC will express a non-important improvement on the GA is called the negative predictive value (NPV) [21].

SDC
The reliability of change scores the Smallest Detectable

SEM
Agreement between T1 and T2 was expressed as the intraindividual standard deviation, also known as the Standard Error of Measurement [13]. The SEM is a standard error in an observed score that obscures the true score and is given in the units of the PROM. The SEM = √ intra indiviual variance of an ANOVA analysis. The difference between a subject's PROM score and the true value would be expected to be within ± 1.96SEM for 95% of the individuals. The assumption that the score distribution is unrelated to the magnitude of the measurement (heteroscedasticity) was checked by plotting the individual patient's standard deviations against his or her means.

ICC
The reliability parameter was the Intra-class Correlation Coefficient, ICC. ICC estimates and their 95% CI were calculated using an absolute agreement, two-way randomeffects single measures model. Based on the 95% CI of the ICC, estimate values less than 0.40 indicate poor reliability, while estimates between 0.4 and 0.59 indicate fair, 0.6-0.74 good and 0.75-1.00 excellent reliability [22]. The relation of the ICC to the SEM is described as SEM = SD √ 1-ICC.

Kappa
The reliability measure weighted kappa was calculated for the categorical variables (i.e., GA BACK/LEG and SF36 GH ). An instrument is reliable when the kappa is above 0.70 [6]. Since these instruments have several ordinal response options, kappa was calculated using the weighting scheme of quadratic weights which is mathematically identical to an ICC of absolute agreement. Further, overall agreement between T1 and T2 as well as the proportion of respondents indicating a better outcome at T1 than at T2 or vice versa were calculated. IBM SPSS Statistics for Windows, Version 24.0. Armonk, NY: IBM Corp. was used in all the statistical analyses apart from the MIC computations, where JMP ® , Version 13.1 SAS Institute Inc., Cary, NC, 1989-2019, was used.

Ethical considerations
Informed consent was obtained from all participants in Swespine, and written consent was acquired from the participants in the retest study. This research project was approved by the regional ethical review board.

Descriptives
In total, 248 participants filled out the booklet at T1. Both questionnaires were returned by 182 (74.6%) participants, 83 from the pre-op group and 99 from the post-op group. Table 1 presents demographics and mean PROM scores at baseline and at the 1-year follow-up for the retest group and for the Swespine population.

Timing of measurements
The time interval between T1 and T2 was 20 ±8 days. The number of days between T1 and T2 did not correlate with the PROM scores

Measurement error and score change reliability
There were no statistically significant systematic differences between T1 and T2 as measured by the Wilcoxon sign rank test (NRS, ODI, EQ-5D INDEX ) and the Sign test (GA, Satisfaction, SF-36 GH ) for any of the PROMs.
The data were not found to be heteroscedastic, meaning that the measurement error appeared to be uniform across scale values. Table 2 presents reliability measures of each PROM demonstrating excellent or good reliability and large SDCs for all prospective PROMs. The influence of random error on the SDCs is illustrated in Fig. 2, with the ODI illustrating the typical pattern.
The MIC calculations were based on the lumbar Swespine register population, stratified for diagnosis [19]. In Table 3, the SDCs are compared to these MIC values. For NRS BACK/LEG and ODI the SDCs exceeded the MICs to some extent. As for the EQ-5D, the difference was more remarkable.
In Table 4, the SDCs are compared to the MIC values that were calculated for the entire lumbar Swespine population. The SDC for both NRS scales exceeded the corresponding MICs, while the SDC and MIC were equal for the ODI. The considerable gap between the SDC and MIC for the EQ-5D INDEX remained. The AUCs were all above 0.70. The ODI had the best ability to correctly classify patients as importantly improved according to GA with a sensitivity of 76%. The specificity was similar for all PROMs. NRS LEG reached the highest specificity (83%), indicating the best ability to correctly classify patients as not importantly improved.
The weighted kappa for the categorical variables GA BACK/LEG and SF-36 GH were above the level of  acceptance. The percentages of agreement are given in Table 5.

Discussion
This study found large SDCs, frequently exceeding tough MIC cut-off values, for some of the most commonly used PROMs in spine surgery research. The error was mainly due to a large intra-individual variation between the two test occasions and not to systematic differences. It has important implications. For instance, consider a trial exploring a possible difference in outcome between two groups undergoing posterolateral fusion with or without interbody fusion, and the outcome variable is NRS BACK . Then-according to the present study-both groups need to reach a change of 3.6 before there is a 95% certainty that the change from baseline is not a mere chance. If-and only if-both groups reach this level of improvement, the research question can be answered.
In other studies on low back pain populations, using the same definition of SDC as in this paper, the SDCs were also rather high: 2.4-4.7 for NRS BACK , 11-16.7 for ODI, and 0.28-0.58 for the EQ-5D [4,5,23].
The MIC corresponds to the minimal level of change that makes the efforts of the surgery worthwhile. A statistically detectable change does not reveal any information about its value in real life. That estimation has to be based on opinions of the persons undergoing the treatment. Accepting the opinion-based MIC does however not allow for the exclusion of the SDC! Fig. 2 A Bland-Altman plot of the ODI test-retest scores. The horizontal line close to "0" illustrates the mean difference in score between the test and the retest occasion. The upper dotted line is of interest if the concern is an improvement and the lower line if the research question is about a deterioration. Values within these limits are, with 95% confidence, due to random error, n = 172    If we recycle the example above but change the research question to whether there is a clinically important difference between the groups or not, a MIC in NRS BACK of 2.9 must be reached by both groups before the question can be answered. Note that the answer should not be given in terms of a mean difference between the groups, but rather as the percentage in each group reaching the MIC cutpoint. However, as the SDC was 3.6, a change of 2.9 may just be a measurement error-no matter the importance of personal opinions.
As long as it can be shown that the MIC estimate exceeds the SDC it can be used separately. But as soon as it is the opposite way, both the SDC and the MIC should be presented in such a manner that the reader can get a clear picture of the true degree of change. This simultaneous usage of both a distribution-based cut-off value and anchor-based estimate has earlier been advocated by Terwee and colleagues [13].
If the SDC by far outreaches the MIC, as was the case for EQ-5D, the use of that PROM should not be accepted, simply because the size of the error is too large to make sound inferences. Why this was the case for EQ-5D in the current study is not clear. Variations in measurement-ofchange estimates for this particular PROM stretch from 0.15 to 0.45 [24]. In this study, the SDC was 0.48 and the MIC was 0.10-0.18 depending on which diagnosis group the calculations were based. A possible explanation is that the preference-based summary index systematically divides the population in two, making it difficult to define an SDC, which is based on dispersion.
Based on the large Swespine database, the MIC values in this study may be considered credible. However, it must be remembered that the MIC is anchored to a retrospective single-item transition question, requiring that each patient remembers his or her health state prior to their operation. Also demanded is an honest response about the degree of improvement or deterioration where the patient excludes factors such as disappointment, gratitude, insurance, sick leave or work-related issues. The human nature probably makes sure that recall bias and response shift will always have an impact on the response to these types of questions.
The PPV of 0.88 for NRS LEG indicates the probability that patients with a change exceeding the MIC, also classified themselves as being importantly improved on the anchor. The NPV of 0.64 is the probability that patients with a change less than the MIC self-assessed a non-important improvement on the anchor.
The reliability of the retrospective single-item questions, interpreted by their weighted kappa values, was almost perfect (above 0.8) or substantial (0.75) according to Landis and Koch [25]. A high weighted kappa also indicates that misclassifications mainly occurred between adjacent response options.

Conclusion
A consequence of large measurement errors in PROMs, is the need of considerable change in outcome in order to distinguish a random error from true change.