Introduction

The most prevalent symptoms associated with indication for lumbar spinal fusion are back and leg pain (BP, LP), along with compromised functionality [1]. Nevertheless, the clarity of data regarding the effectiveness of surgical interventions for individual patients in alleviating these symptoms appears to be lacking when considering all the patient's characteristics [2, 3]. Especially in patients with degenerative disease of the lumbar spine, and even more so in patients with discogenic chronic low back pain, some patients profit massively from fusion surgery, while others experience no difference at all or even worsen. On average, there is evidence that spinal fusion in this population is no better than conservative management [4, 5]. To address this issue in the future and enhance the ability to distinguish, which patients may benefit more from lumbar spinal fusion, Khor et al. [6] introduced the SCOAP-CERTAIN tool in 2018. This tool aims to determine the probabilities of improvement in function, back pain, and leg pain for lumbar fusion candidates one year after surgery [6]. These models demonstrated good accuracy in both the development and internal validation cohorts, making them potentially suitable for integration into everyday clinical practice. Still, the importance of rigorous validation of CPMs on multicenter, data from different populations (external validation) cannot be stressed enough: Only through proper external validation the reliability and clinical applicability of CPMs can be ensured [7, 8]. Up to now, the SCOAP-CERTAIN tool has only been validated in a single Dutch center with 100 patients, revealing good discrimination but rather poor calibration [9]. As predictive probabilities hold more significance for clinicians and patients than binary classifications in making decisions about surgery, it might be premature to apply the current prediction tool in clinical practice. Hence, we aimed to conduct a comprehensive external validation study involving 1115 patients from multiple centers to reevaluate the predictive ability of the Khor et al. [6] model regarding improvement in function and pain following lumbar spinal fusion for degenerative disease.

Materials and methods

Overview

A dataset comprising 1115 patients who underwent elective lumbar spinal fusion for degenerative disease from a multinational study (FUSE-ML) [10] was utilized to externally validate the machine learning-based model published by Khor et al. [6]. This model predicts improvement in functional outcome (Oswestry Disability Index, ODI), back, and leg pain. We compared the values predicted by their model with the true outcomes at 12 months after lumbar fusion in our cohort, providing a rigorous multicenter external validation of this model. Approval for the utilization of patient data in research was granted by individual local institutional review boards (IRBs) of FUSE-ML centers. Patients either gave informed consent, or the requirement for informed consent was waived as per the local IRB’s stipulations.

Patient population

Data were extracted from a prospective registry that included patients undergoing elective thoracolumbar pedicle screw placement for up to 6 levels, addressing degenerative pathologies such as spinal stenosis, spondylolisthesis, disc herniation, failed back surgery syndrome (FBSS), radiculopathy, or pseudarthrosis. Patients were excluded if when the primary surgical indication was one of the following: infections, spinal tumors, fractures (traumatic or osteoporotic) or deformity surgery for scoliosis. Additionally, patients with moderate or severe scoliosis (Coronal Cobb’s > 30°/Schwab classification sagittal modifier + or + +), those with missing outcome data at 12 months, a lack of informed consent, or those younger than 18 years were excluded. Our manuscript has been developed in line with the guidelines outlined in the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement [11].

Data collection

Data preparation adhered to the guidelines established by Khor et al. [6]. Primary clinical and radiological information was obtained during the initial outpatient appointment. Subsequently, patients underwent a comprehensive clinical examination and magnetic resonance imaging (MRI). Collected outcome parameters included Oswestry Disability Index (ODI), numeric rating scale for back (NRS-BP) and leg (NRS-LP) pain, separately, as well as gender, age, smoking status, ethnicity, American Society of Anesthesiologists (ASA) grade, opioid consumption, presence of asthma, and prior spine surgery. In one center, functional outcome was assessed using the Core Outcome Measures Index (COMI), which was converted into the ODI according to a validated mapping function [12]. The clinical outcome parameters ODI, NRS-BP, and NRS-LP were collected again at the 12-month follow-up.

Outcome measures

ODI, NRS-BP and -LP were collected in the form of a standardized questionnaire with values ranging from 0 to 100 [13], and 0–10 [14], respectively, with higher values representing increasing severity in functional disability or pain. As with Khor et al. [6], we established the term clinical improvement as achievement of the minimum clinically important change (MCIC) threshold of a ≥ 15-point reduction for ODI and a ≥ 2-point reduction for NRS-BP and -LP [15, 16].

Statistical analysis

Missingness is reported in Supplementary Table 1. To address the absence of data in the predictor variables, which were presumed to be missing at random, we conducted imputation using a k-nearest-neighbor approach[17]. Patients with ODI of lower than 15 or NRS of lower than 2 were removed from the respective analyses as specified by Khor et al.[6], as these patients have no way of achieving MCIC in the respective outcome. The three CPMs of the SCOAP-CERTAIN tool were then reconstructed using the reported model parameters and intercepts. We calculated the Area under the Receiver Operating Characteristics Curve (AUC) by comparing predicted probabilities with the actual MCIC outcome at the 12-month mark. Calibration was evaluated using both visual inspection of calibration curves and quantitative analysis, including the calibration intercept and slope (optimal calibration intercept: 0; optimal calibration slope: 1). Calibration assesses the extent to which a model’s predicted probabilities, spanning from 0 to 100%, align with the observed incidence of the binary endpoint, which represents the true posterior [18]. Additionally, in terms of calibration, we examined expected/observed event ratios (E/O-ratios) which describes the overall calibration of a prediction model [7], the Brier Score [19], and the Estimated Calibration Index [20]. Likewise, the Hosmer–Lemeshow (HL) test was employed for assessing goodness-of-fit, which gauges whether the observed event rates align with the expected event rates within different population subgroups [21]. The binary classification threshold was set at 0.5, as this cutoff is most likely the one to be used by Khor et al. [6] and also appears suitable for the dataset based on the “closest-to-(0,1)-criterion”. Following this, we compared the binary classifications to the actual observed MCIC outcome in confusion matrices and calculated various performance metrics, including accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and the F1 Score. All continuous data are reported as mean ± standard deviation (SD). Whenever relevant, we offer bootstrapped 95% confidence intervals (CIs) using 1000 resamples with replacement. All analyses were performed using R Statistical Software (v 4.3.0; R Core Team 2023) [22].

Results

Overview

A total of 1115 patients were included in this study with a mean (SD) age of 60.8 (12.5; range 19–89) years, of which 455 (40.8%) were male. Patient characteristics and surgical parameters are presented in Table 1. Lumbar spinal stenosis accounted for 55.4% of indications for surgery, followed closely by spondylolisthesis with 53.7%. The number of patients with a high ASA score was 29.5%.

Table 1 Summary of patient characteristics and outcome measures

Patient‑reported outcome

At the 12-month postoperative mark, there was a notable improvement in ODI scores, with a mean change of −21.8 ± 16.7 from baseline. Additionally, NRS-BP and NRS-LP showed improvements of −3.1 ± 2.4 and −2.5 ± 2.5, respectively. The MCIC was achieved by 68% of patients for ODI. Furthermore, NRS-BP and NRS-LP both saw MCIC achievement in 77% and 71% of patients, respectively. Table 1 provides a summary of the outcome measures for our external validation cohort.

Performance evaluation

Calibration

Table 2 shows a detailed list of calibration metrics of the external validation. In predicting the MCIC in ODI at 12 months, we observed a calibration intercept of 1.01 (95% CI 0.87–1.16) and a slope of 0.84 (95% CI 0.68–1.01), along with a HL p-value < 0.001 (refer to Fig. 1A). The low E/O-ratio of 0.75 (95% CI 0.71–0.79) suggests that the model tended to underestimate the likelihood of a favorable outcome. In our multicenter cohort, Brier Score achieved a moderate accuracy of the probability forecast of 0.22 (95% CI 0.21–0.23).

Table 2 Calibration performance metrics of the three prediction models on external data
Fig. 1
figure 1

AC Calibration plots for prediction of improvement in 12-month ODI (Panel A), back (Panel B), and leg pain (Panel C) according to the NRS. Calibration intercept and slope for ODI were 1.01 and 0.84, for BP-NRS 0.97 and 0.87, and for LP-NRS 0.04 and 0.72, respectively. Red lines depict the ideal calibration, black line shows the flexible calibration according to Loess, the triangles stand for the grouped observations. ODI oswestry disability index, BP back pain, LP leg pain, NRS numerical rating scale, LOESS locally estimated scatterplot smoothing

Similarly, when predicting MCIC in NRS-BP, we identified a calibration intercept of 0.97 (95% CI 0.80–1.15), a slope of 0.87 (95% CI 0.70–1.08), and an E/O-ratio of 0.97 (95% CI 0.93–1.00), with a corresponding HL p-value < 0.001 (as depicted in Fig. 1B). For the prediction of MCIC in NRS-LP within our cohort, we found a calibration intercept of 0.04 (95% CI −0.14–0.22) and a slope of 0.72 (95% CI 0.55–0.90). The HL p-value was < 0.001. Notably, the model appeared to overestimate the likelihood of a favorable outcome applied on our cohort when examining the calibration plot (refer to Fig. 1C), which was further supported by the high E/O-ratio of 1.16 (95% CI 1.13–1.20). Brier Scores for back and leg pain showed increased accuracy compared to the ODI with 0.16 (95% CI 0.15—0.18) and 0.15 (95% CI 0.13—0.16), respectively.

Discrimination

Table 3 provides a comprehensive overview of discrimination measures, while Fig. 2A–C illustrate the AUC values for each individual center for the three models during external validation. When predicting MCIC for ODI, we achieved an AUC of 0.70 (95% CI 0.67–0.74), with a sensitivity of 0.63 (95% CI 0.59–0.66) and specificity of 0.68 (95% CI 0.62–0.73). Similarly, in predicting MCIC for NRS-BP, we obtained AUC values of 0.72 (95% CI 0.68–0.76), sensitivity at 0.84 (95% CI 0.82–0.87), and specificity at 0.45 (95% CI 0.38–0.52) during external validation. Finally, for predicting NRS-LP, the AUC reached 0.70 (95% CI 0.66–0.74), with a very high sensitivity of 0.96 (95% CI 0.94–0.97) and low specificity of 0.15 (95% CI 0.10–0.20).

Table 3 Discrimination performance metrics of the three prediction models on external data
Figure. 2
figure 2

AC Forest plots for the area under the curve (AUC) for all three prediction models: ODI (Panel A), back (Panel B) and leg pain (Panel C) according to the NRS. Listed is the overall summary in addition to all centers individually. ODI oswestry disability index, BP back pain, LP leg pain, NRS numerical rating scale, CI Confidence interval

Discussion

To address the problem of significant variability in postoperative outcome after lumbar fusion surgery due to a wide range of patient characteristics [5], CPMs were developed assisting in the decision-making process [23]. Khor’s model demonstrated good calibration and performance in their own, internal validation cohort [6] with comparable values in a small single-center external validation cohort [9]. Here, we performed a rigorous, multicenter external validation of Khor’s models (coined the SCOAP-CERTAIN tool) for prediction achieving the MCIC for 3 different clinical outcomes at 12 months postoperatively after lumbar fusion for degenerative disease. With data from the FUSE-ML study, we assess generalization of these CPMs and find that – while in terms of discrimination (binary prediction) the models generalize moderately well – the calibration (continuous risk assessment) seems to lack in robustness, although the cohorts appear comparable.

It is notoriously difficult to predict treatment response for patients undergoing lumbar spinal fusion for degenerative disease. While some indications such as isthmic spondylolisthesis represent a relatively clear indication for fusion, others such as low-grade degenerative spondylolisthesis with stenosis are less clear to benefit from addition of fusion [24, 25] The most extreme example certainly is chronic low back pain with concomitant discopathy [26]. While some individual patients with this pathology do profit from fusion, an unselected population does not: Randomized studies consistently indicate that, on the whole, fusion surgery does not yield significantly superior outcomes compared to conservative treatment for chronic low back pain [27]. Although surgery may not exhibit a clear advantage over conservative approaches in unselected patients with chronic low back pain, specific subsets of patients can genuinely experience benefits [28]. The critical factor for success in degenerative spine surgery lies in meticulous patient selection.

In the past, different methods were established to help select the best treatment option of the individual patient. From discography to pantaloon casting or considering radiological modifiers such as Modic-type endplate changes, many potential predicts of surgical success were evaluated, but often with very limited predictive ability [26, 28]. First, mostly radiological or physician-based outcomes were assessed, but over time, patient-reported outcome measures (PROMs) such as ODI [29] were implemented and validated trying to quantify and weigh symptoms to in the end justify risk and benefits of a potential surgery [30]. This then opened up the possibility of truly personalized medicine: Currently, the aim and idea of medical decisions is to consider every personal aspect of a patient’s physical and mental characteristics for the perfect treatment to fulfill the wide range of demanding aspects, such as symptom release for the patient, healing or preventing progression of a disease and balancing costs of the healthcare system by avoiding unnecessary diagnostics and treatments and complications [31, 32]. Another delicate aspect complicating medical decision making, is the wide range of symptoms that can be present in patients with degenerative lumbar spine diseases, e.g. facet-mediated pain, discogenic pain or myofascial pain [33], among others. The easiest would be, if we could pinpoint specific symptoms or patient characteristics, knowing that lumbar fusion would ease this symptom. With more information regarding the patient and e.g. the comorbidities to weigh up the risks of surgery in general versus the expected benefit, this could lead to improved risk–benefit counseling during clinics [34].

Thus, the aim of CPMs in the surgical field is to tell, which patients do benefit of a certain intervention, and which do not. Khor et al. [6] have published an internally validated CPM tool (SCOAP-CERTAIN) that aims at assisting in surgical decision making by providing predictive analytics on which patients scheduled for lumbar spinal fusion for degenerative disease are most likely to show significant 12-month improvement in functional outcome and pain severity. Rigorous multicenter/multicultural external validation is a crucial process necessary before clinical implementation of CPMs [7, 8, 35]. To assess generalization of a CPM, calibration and discrimination need to be quantified [36]. Discrimination refers to a model’s capacity to precisely categorize patients in a binary way, namely into those experiencing MCIC and those who do not see a clinically relevant improvement. On the other hand, the model’s capability to generate accurate predicted probabilities (between 0 and 1) that closely align with the true posterior (observed frequency) is termed calibration. The SCOAP-CERTAIN tool had previously been evaluated in a small single-center external validation study of Dutch patients, demonstrating adequate discrimination but only fair calibration [9]. In a previous study of the FUSE-ML study group, a second, simpler CPM for the same outcomes was developed, with the goal of achieving similar predictive power with a lower amount of input variables [10]. This goal was broadly achieved, and within that study, a small external validation (in three centers with a total of 298 patients) of the SCOAP-CERTAIN tool (with the goal of comparing both CPMs performances) was carried out, showing again relatively robust discrimination but only fair calibration of both models [10].

Although CPMs in degenerative spine surgery could in theory be highly beneficial if added into the clinical context, rigorous external validation is necessary first to make sure that models are not “let loose too early” [8, 34]. It is especially necessary to not only test models in one or two small cohorts, but rather in a wide range of different patient populations from multiple countries and continents – If performance then shows itself to be robust, it can be safely assumed that the CPM will achieve the expected predictive performance in real-world patients, and the model can be safely rolled out. In the present study, we performed such an extensive external validation study. With AUC between 0.70 and 0.72 in ODI, NRS-BP and NRS-LP we were able to show good discrimination metrics, comparing with those reported in Khor et al.’s initial internal validation study (0.66–0.79) [6]. Yet, calibration – evaluated through diverse metrics – again demonstrated only moderate performance, as in the previous small external validation studies. In the context of internal validation, Khor et al. had documented calibration intercepts ranging from −0.02 to 0.16, along with slopes spanning 0.80–1.05, whereas we reached a wider range from 0.04 to 1.01 for intercepts and less well calibrated values with 0.72–0.87 for slopes, respectively – even though outcome distribution was similar to the development cohort (as it is known that calibration intercepts are highly dependent on differences in outcome distribution) [37]. Summarizing, there was substantial heterogeneity in the observed calibration slopes along with a higher ECI, a measure of overall calibration, defined as the average squared difference of the predicted probabilities with their grouped estimated observed probabilities [18]; and clearly worse testing for goodness-of-fit by the method of Hosmer and Lemeshow [21]. The HL method is based on dividing the sample up according to usually 10 groups of predicted probabilities and then compares the distribution to a chi-square distribution with a p-value > 0.2 usually being seen as an indication of fair calibration/goodness-of-fit [18, 21]. Of course – as is the goal of external validation – our external validation cohort represents a much more heterogenous population than the development cohort, now including European and Asian individuals, which explains some of the lack of generalization in terms of calibration. In the realm of CPMs, calibration might arguably carry a more significant role than discrimination alone [37]. This is because clinicians and patients are typically more concerned with predicted probabilities of a specific endpoint rather than a binary classification – individual patients, after all, are not binary, but carry a spectrum of expected risks and benefits [7]. Hence, insufficient calibration poses a significant obstacle to the clinical and external applicability of prediction models. Another potential explanation of the poor generalization in terms of calibration can be explained by different definitions of input variables: Although our data collection adhered strictly to the definitions provided by the Khor et al. [6] paper, institutional protocols and inter-rater assessment still varies. This is one of the general limitations of CPMs based on tabulated medical data: Because data must first undergo multiple stages of summarization and simplification by human healthcare providers, the overall predictive power can quickly reach “ceiling effects” due to the input heterogeneity. This is another reason why external validation is so crucial: To test out whether CPMs work just as well if applied in a real-world environment (effectiveness vs. efficacy). In the future, direct inclusion of source data (such as MRI) without human coding, or automated data collection through natural language processing, might somewhat alleviate this bottleneck [38].

Still, even if not perfectly calibrated in a rigorous external validation study, the models published by Khor et al. [6] are admirable and show good generalization overall, especially in terms of discrimination performance – no signs of overfitting can be observed here. Overfitting manifests as a relevant difference between training and testing performance in terms of discrimination [35]. It is common for out-of-sample performance to be comparable to or slightly worse than the training performance for a well-fitted model. The observed discrimination performance in our external validation study fits this norm well. It can be concluded that the SCOAP-CERTAIN model can safely be applied in clinical practice, although it must be kept in mind that predicted probabilities (calibration) should only be used as rough estimates, and that binary predictions – while generalizing well (discrimination) – still are no more accurate than an AUC of around 0.70.

In the end, in the realm of degenerative spine surgery, well-validated CPMs such as the SCOAP-CERTAIN [6] or FUSE-ML [10] models should only be used cautiously as rough estimates to offer an objective “second opinion” in the risk–benefit counseling of patients, but never as absolute red or green lights for surgical indications. We suggest that a future model should also be capable of predicting longer-term prognosis, as longer-term outcomes will improve the robustness of outcome data in lumbar patients. This could be achieved by incorporating more extended follow-up data and reducing short-term variability. These measures will lead to a more comprehensive understanding of patient trajectories, which is essential for effective clinical decision-making and enhanced calibration.

Additionally, it is crucial that future studies, as previously mentioned in the external preliminary stage, report key metrics such as sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of the models assessed. Reporting these metrics will enable better differentiation and validation of the predicted values, thereby enhancing the reliability and applicability of clinical prediction models in practice.

Limitations

Regarding the primary surgical indication, our cohort showed mostly lumbar spinal stenosis, spondylolisthesis and discogenic low back pain whereas in Khor’s cohort radiculopathy was the leading diagnosis, followed also by stenosis and spondylolisthesis [6]. Of course, surgical indication and especially the chosen technique might vary between centers, which is exactly why multicenter external validation is important. Compared to the development cohort, we also included lateral techniques, which in turn brings a broader range of included patients. We apply a mixed cohort (FUSE-ML) of partially prospectively collected, and partially retrospectively collected data. It is known that the difference of these two strategies have a relevant influence on collected data – especially on complications, which is fortunately not a topic here – as well as on missingness, and could therefore affect final analysis, too [39]. Still, on the other hand, the fact that models still generalized relatively well on these heterogenous data is the point of external validation and even more so proves the robustness of the Khor et al. [6] models. Due to the lack of long-term (> 2 years follow-up) data, even with good calibration and discrimination performance, we are only able to predict short- and mid-term outcomes. More long-term data evaluation regarding CPMs is necessary. The validated models also do not predict surgical risks such as perioperative complications or long-term adjacent segment degeneration, information which would be particularly useful in risk–benefit discussions. The fact that FUSE-ML or SCOAP-CERTAIN models also are not able to provide prognosis of natural history or conservative treatment in these degenerative conditions means that they only provide half of the answer when making decisions on surgical versus conservative treatment strategies.

Conclusion

Utilizing data from a multinational registry, we externally validate the SCOAP-CERTAIN prediction tool. While the model demonstrated good discrimination, the calibration of predicted probabilities was only fair, necessitating caution in applying it in clinical practice. We propose that future CPMs consider predicting longer-term prognosis for this patient population, and emphasize the importance of rigorous external validation, robust calibration, as well as sound reporting.