Introduction

The most important question for the patient to be answered before surgery is: “how much better will I be?” This is often a difficult question for the surgeon to answer reliably, as patient-reported outcome after surgery for degenerative spinal conditions demonstrates major heterogeneity. At follow-up, using Swespine, between 10 and 40% of the patients, depending on the preoperative diagnosis, still may suffer from some spine-related disability and pain [1].

With the development of national quality surgical spine registers such as the Swedish “Swespine”, surgeons and patients are having access to aggregated outcome data, serving as a rough suggestion of possible achievements for an individual patient after surgery. This is, for example, demonstrated in three international Nordic cooperation studies where data from Swespine, NorSpine (Norway) and Danespine (Denmark) are presented [2,3,4].

It is obvious, however, that translating these data, which are presented on a group level, to individualized assessment of surgical success, may be difficult. This is because several socio-demographic characteristics and other baseline variables, not always known, will modify the outcome.

In 2013, using Swespine data, an analytic project was initiated, together with Region Stockholm (https://www.sll.se/om-regionstockholm/Information-in-English1/). The initial aim was to present case-mix-adjusted outcome data publicly for a fair comparison and benchmarking of individual spine centres; https://vardenisiffror.se/jamfor/kallsystem (only in Swedish). This led to the development of a tool for prediction of individual outcome after surgery for lumbar and cervical degenerative conditions. In 2017, we could present the Dialogue Support for members of the Swedish Society of Spinal Surgeons, and it was later made publicly available; http://www.4s.nu/4s-f%C3%B6rening/dialogst%C3%B6d-44852774 (only in Swedish).

The support is an interactive web-based instrument to be used in shared decision-making with the patient when discussing surgery for different spinal disorders. After being translated to English and discussed with the Eurospine board, it was made publicly available at Eurospine Home page in October 2020; https://app.molnify.com/app/7wqw6owgrznr76bkaqc6l4bs7q.

In Fig. 1a and b, two examples of prediction of outcome using the Dialogue Support are demonstrated. To the left on the screen picture, the patient’s individual values of predictor variables are recorded. The right side demonstrates the predicted outcomes for that patient based on the patient’s characteristics combined with algorithms describing the relationship between patient characteristics and outcomes based on large amounts of historical data. The pie chart shows predicted probabilities for the five alternatives of pain change and in the banner at the top of the screen, the predicted pain change and satisfaction with outcome are dichotomously summarized into a percentage of success and satisfaction (for description of outcome variables see Methods). The tool can be used by the reader using the following link to the Eurospine Home page: http://www.eurospine.org

Fig. 1
figure 1

a Exemplifying case: The Dialogue Support used for a woman, 40 years, with a lumbar disc herniation; leg pain for 3 months, NRS (Numeric rating Scale) = 7/10, no back pain, quality of life (EQ-5D) = 0.3/1.0, function (Oswestry Disability Index) = 50/100. She has had no previous spine surgery and is a non-smoker. Assessment: Based on the prediction model, there is an 89% probability for a patient with these characteristics to report a successful outcome (dark and light green sectors added). b The “same patient” is being a smoker, has had previous spine surgery, is 60 years old and has also back pain (NRS) 6/10 and comorbidities. Assessment: Based on the prediction model, there is a 56% probability for a patient with these characteristics to report a successful outcome

Patient-centred outcome prediction is a growing focus in spine research, producing several reports annually. The majority discuss prediction in terms of Patient-Reported Outcome Measures (PROMs) [5,6,7,8,9,10,11,12,13,14,15], a few deals with adverse events [16, 17], length of stay [18], revision surgery [19] or return to work [20]. Among the PROM analyses, the outcome measure is usually dichotomized. The predictive modelling differs, but the most frequently used is based on multivariate logistic regression algorithms [5, 6, 9, 11, 13, 14]. In recent years, machine learning has gained interest as an alternative [9, 12, 21, 22]. Further details of available reports related to degenerative spine surgery are presented in Table 1.

Table 1 Compilation of recently published prediction models

The aim of the current study is to evaluate the predictive precision of the Dialogue Support.

Methods

The dialogue support, www.eurospine.org

The Dialogue Support is predicting outcome 1 year after surgery for patients with selected spinal disorders. The underlying prediction models used have been trained on a sizable body of data throughout Sweden during a 10-year period and are updated every year. The data quantity thus always includes outcomes no more than 1 year old.

The prediction is demonstrated as a proportion of a specific patient group achieving a certain outcome after surgery and answering PROM after 1 year, here global assessment ("Totally pain free-Much better-Somewhat better-Unchanged-Worse”) [23] and satisfaction. Each prediction algorithm (one per diagnostic group) is based on approximately 2000–12,000 individuals, depending on the diagnosis and baseline profile. The included diagnosis groups are lumbar disc herniation (LDH), lumbar spinal stenosis (LSS), lumbar degenerative disc disease (DDD) and cervical radiculopathy (CR), which is caused by either disc herniation or foraminal stenosis.

Data source

Swespine in its current form was started 1998 and to date includes approximately 155 000 operated patients (i.e. index procedures used for predictive evaluation) with degenerative conditions in the lumbar and cervico-thoracic spine. National coverage is 95%, completeness 85% and 1-year follow-up over 70%.

Participants in the current study

Patients with their index surgery between 2007-01-01 and 2019–04-01, and who had one-year follow-up, were included in the analyses, resulting in a total of 87 494 patients: 23 087 with LDH, 51 390 with LSS and 5 872 with DDD 7 154 with CR. All patients have given consent prior to registration in Swespine, including information that their data will be used in clinical studies and that they can withdraw their individual data from the registry at any time. The evaluation procedure follows the TRIPOD recommendations (Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) [24].

All analyses were performed separately for the four different subgroups. Patients with missing data on the outcome variable or any of the explanatory variables were excluded. No imputations for missing data were performed.

Outcome

The two outcome measures used were:

  1. A. Global Assessment (GA)

    is an ordinal type Likert scale with six response alternatives; “How is your back/leg pain today as compared to before the surgery?” where 0 represents no back/leg pain before the surgery, 1 = completely pain free, 2 = much better, 3 = somewhat better, 4 = unchanged, 5 = worse [23]. Patients responding with the 0-alternative are excluded in the analyses. For the cervical spine, the question relates to neck/arm pain. Leg pain is the outcome in Spinal stenosis and LDH groups, back pain in the DDD group and arm pain in the CR group. In the summarized presentation, GA was dichotomized into success (response alternatives 1 and 2) and failure (response alternatives 3–5).

  2. B. Satisfaction (SAT)

    with treatment outcome, an ordinal Likert scale with three response alternatives (satisfied, hesitant and dissatisfied). In the analysis this variable was dichotomized to satisfied and not satisfied (hesitant or dissatisfied).

Predictors

A large number of socio-demographic and clinical baseline factors deemed relevant for predicting outcome were evaluated for inclusion in the models. Eligible predictors were all demographic and baseline data in Swespine. Link (in Swedish); http://www.4s.nu/swespine-formul%C3%A4r-44871294.

Statistical analysis

Development of algorithms underlying the prediction model was carried out in three steps.

  1. 1.

    Model development and variable selection

For the analysis of GA, an ordered probit model was estimated to account for the five levels in the outcome variable. The analysis of satisfaction was based on a logistic regression model. Backward variable selection based on Akaike information criterion was employed to select variables for inclusion in the final model. Model selection was performed out of sample by randomly splitting the data into a training data set (80% of sample) and a test data set (20% of sample).

  1. 2.

    Model validation

To evaluate accuracy of the model, calibration plots and receiver operating characteristic (ROC) curves were used [25]. In the calibration plots, patients were divided into 20 equally sized groups based on predicted value for their outcome, and then, the average actual outcome within each group was calculated. Diagnostic ability was also evaluated using receiver operating characteristic (ROC) curves, where the ability to classify patients into the correct group is assessed when varying the discrimination threshold. The dotted line indicates that the model predicts no better than chance, described by an area under the curve (AUC/c-statistic) of 0.5. A line extending to the upper left corner indicates perfect separation with an AUC/c-statistic of 1.0. In these analyses, GA was dichotomized into success (response alternatives 1, 2) and failure (response alternatives 3, 4, 5)

  1. 3.

    Model estimation on entire sample

Once variable selection had been performed in step 1 and validated in step 2, the final models were re-estimated on the entire data set, i.e. both the training data and the test data, to get the largest possible number of observation for the final model parameter estimation.

A separate model is estimated for each of the 4 diagnoses of patients and for each of the 2 outcomes, resulting in 8 ROC and 8 calibration plots.

Data management and statistical analysis were conducted using R Statistical Software version 4.0.0.

Results

Description of the study population

The eligible study population was reduced because of dropouts at follow-up and missing data of predictor variables as shown in Table 2. In general, the differences of baseline data between the diagnostic groups were small or moderate. The DDD group had the longest duration of pain and the highest frequency of earlier spine surgery (Table 3).

Table 2 Study population stratified by diagnosis group and outcome measure
Table 3 Descriptive statistics of the study population

Assessment of predictive ability

  1. I.

    Calibration plots

The plots demonstrate how well predicted probabilities agree with actual outcome for subgroups with different case mix. Observations were ranked according to the predicted value and grouped in 20 categories in the lumbar diagnostic groups and in 5 categories in the cervical group. The proportion with actual successful outcome (y-axis) and predicted value (x-axis) was calculated for each category and plotted against each other. The solid line represents perfect calibration and dotted line represents the actual results. The concordance between prediction and actual outcome measured with GA for success and satisfaction on a group level was high, with small differences between diagnostic groups. Satisfaction in the DDD group was least concordant. The findings are demonstrated in Fig. 2.

  1. II.

    ROC curves

The ROC curve demonstrates the ability of the model to separate successful cases and failures. As shown in Fig. 3, the ability of the prediction models to discriminate between successes and failures on individual level is fair, with AUC ranging from 0.67 to 0.68. There were slight differences in model fit between the diagnostic groups. For satisfaction, the AUC value varied more between diagnostic groups, from 0.6 for DDD to 0.67 for CR.

  1. III.

    Model estimation on entire sample

Fig. 2
figure 2

Calibration plots of success in a the LDH group, b the LSS group, c the DDD group and d the CR group; of satisfaction in e the LDH group, f the LSS group, g the DDD group and h the CR group

Fig. 3
figure 3

ROC curves of satisfaction in a the LDH group, b the LSS group, c the DDD group and d the CR group; of success in e the LDH group, f the LSS group, g the DDD group and h the CR group

Table 4 presents the effect of each predictor on the two outcomes for the four different patient groups. Indicators of lower socio-economic status, such as smoking, disability pension and unemployment, were consistently associated with lower satisfaction and less pain improvement. Previous spine surgery was a negative predictor for all diagnostic groups. Short duration of back/neck pain was associated with more pain improvement and higher satisfaction in all diagnostic groups. Short duration of leg/arm pain was associated with more pain improvement and higher satisfaction in the LDH, CSS and CR groups. Age and gender were of minor importance, as was also ODI, whereas a higher quality of life (EQ-5D) at baseline predicted higher satisfaction at follow-up for all but the CR group.

Table 4 Impact of predictors on the two outcomes in the four diagnostic groups

Discussion

The Dialogue Support, based on the national Swedish quality register “Swespine”, presents the predicted outcome after surgery for degenerative spinal disorders. The outcome is presented as a percentage of outcome according to global assessment of pain (GA) and satisfaction with outcome based on the patient’s characteristics combined with algorithms describing the relationship between patient characteristics and outcomes based on large amounts of historical data.

The calibration plots of subgroups demonstrate a high degree of concordance, with minor differences between diagnostic groups. The message to the patient can be expressed as follows: in the group of earlier operated individuals with a similar baseline profile as you, a certain percentage reported after 1 year “complete relief of pain”, another percentage reported “much better”, a third percentage “somewhat better”, etc. This is visualized in the pie diagram.

On an individual level, as estimated with ROC curves, the precision of the predictive model was fair. For global assessment of pain, the AUC value ranged from 0.67 to 0.68. For satisfaction, it ranged from 0.60 (DDD) to 0.66 (LDH). Other reports with PROMs as outcome measure and logistic regression as analytic method describe AUC values and c-indexes ranging from 0.64 to 0.79, mostly tested on single-centre cohorts [5, 6, 9, 11, 14].

In recent years, new computer based analytic methods, often called “machine learning” and “deep learning”, have been proposed as possibly more powerful methods of data acquisition and analysis. The suggested advantage of these techniques appears not to be determined [26]. Reported AUC values or c-indexes with these analytic methods range from 0.59 to 0.90 [8, 12, 21, 22]. Probably the more critical aspect of outcome prediction, and possibilities of improving precision, is related to addition of more predictor variables. This appears to be more important than focusing on analytic techniques, although different analytic models may change the predictive potential of a particular data set, as has been demonstrated on a hip and knee arthroplasty cohort [27].

The limitation of the current model as such is related to the baseline and socio-demographic variables at hand. Evaluation and development of the Dialogue Support is a continuous process. There could be some changes in the surgery techniques, processing, trainings, support, machineries and medicines affecting outcome. This is considered by yearly updating of the database. The yearly updating of the database and introduction of new baseline variables are also expected to increase the precision of the model on the individual level over time. Dropouts may have a tendency of worse outcome than consenters also in registers [28], so there is a possible risk of overestimating success in the model. However, this is still an unsolved question [29].

A possible selection bias could be caused by the proportion of patients not having their index surgery recorded. However, there are only exceptional occurrences of patients opting out from Swespine. The major cause of loss to registration is deficient routines in some of the participating centres, not dependent on individual patients. In our interpretation, this does not cause any bias in the national register. A second possible cause of selection bias would be loss to follow-up. This has been assessed in a recent publication [28].

When it comes to application of the Dialogue Support in other countries, the Swedish reference database can be a limitation to generalizability. Spine patients in different countries have different cultural and social conditions, which may affect the predictive ability of the model. Interpretation of predictions should be done with this in mind, until validation tests have been performed. Ideally the model should be tested on a national, or other large, database in the country in question.

The strength of the Dialogue Support is related to the large reference population in the national Swespine Registry, which has a coverage of 95%, a completeness of 85% and a one-year follow-up of more than 70%. Thus, the data that the prediction model is based on well represent the “degenerative spine population” in Sweden over the last ten years. Predictions can thus be generalized in the entire nation and applied to all spine centres. Dropouts and missing data may infer a limitation to the generalizability, which we hope to reduce with increasing web-based registration.

In the aggregated perspective, the Dialogue support, acting as “one piece of the puzzle”, can support the clinician’s clinical experience/judgement. In popular terms, using the Dialogue support, it is possible to describe outcome as a certain probability to have a successful and satisfactory outcome after a proposed surgical intervention.

Thus, the Dialogue support offers the opportunity to both patient and surgeon to contemplate and discuss the probability of benefit and risk based on more substantial evidence than the experience and conjecture of an individual surgeon. It is also an interesting opportunity to start international research cooperation based on the Dialogue support.

Our ambition is to validate these models further once new data becomes available (data are continuously collected in Swespine, approximately 10 000 patients per year), and we can also foresee validating models in other countries where similar (albeit not identical) data are collected.

Conclusion

The Dialogue Support is a useful prediction tool with an accuracy which is high on a group level and moderate on an individual level. It can serve as an aid to both patient and surgeon when discussing a surgical treatment of degenerative conditions in the lumbar and cervical spine.