Prediction of outcome after spinal surgery—using The Dialogue Support based on the Swedish national quality register

To evaluate the predictive precision of the Dialogue Support, a tool for additional help in shared decision-making before surgery of the degenerative spine. Data in Swespine (Swedish national quality registry) of patients operated between 2007 and 2019 found the development of prediction algorithms based on logistic regression analyses, where socio-demographic and baseline variables were included. The algorithms were tested in four diagnostic groups: lumbar disc herniation, lumbar spinal stenosis, degenerative disc disease and cervical radiculopathy. By random selection, 80% of the study population was used for the prediction of outcome and then tested against the actual outcome of the remaining 20%. Outcome measures were global assessment of pain (GA), and satisfaction with outcome. Calibration plots demonstrated a high degree of concordance on a group level. On an individual level, ROC curves showed moderate predictive capacity with AUC (area under the curve) values 0.67–0.68 for global assessment and 0.6–0.67 for satisfaction. The Dialogue Support can serve as an aid to both patient and surgeon when discussing and deciding on surgical treatment of degenerative conditions in the lumbar and cervical spine. I.


Introduction
The most important question for the patient to be answered before surgery is: "how much better will I be?" This is often a difficult question for the surgeon to answer reliably, as patient-reported outcome after surgery for degenerative spinal conditions demonstrates major heterogeneity. At followup, using Swespine, between 10 and 40% of the patients, depending on the preoperative diagnosis, still may suffer from some spine-related disability and pain [1].
With the development of national quality surgical spine registers such as the Swedish "Swespine", surgeons and patients are having access to aggregated outcome data, serving as a rough suggestion of possible achievements for an individual patient after surgery. This is, for example, demonstrated in three international Nordic cooperation studies where data from Swespine, NorSpine (Norway) and Danespine (Denmark) are presented [2][3][4].
It is obvious, however, that translating these data, which are presented on a group level, to individualized assessment of surgical success, may be difficult. This is because several socio-demographic characteristics and other baseline variables, not always known, will modify the outcome.
In 2013, using Swespine data, an analytic project was initiated, together with Region Stockholm (https:// www. sll. se/ om-regio nstoc kholm/ Infor mation-in-Engli sh1/). The initial aim was to present case-mix-adjusted outcome data publicly for a fair comparison and benchmarking of individual spine centres; https:// varde nisiff ror. se/ jamfor/ kalls ystem (only in Swedish). This led to the development of a tool for prediction of individual outcome after surgery for lumbar and cervical degenerative conditions. In 2017, we could present the Dialogue Support for members of the Swedish Society of Spinal Surgeons, and it was later made publicly available; http:// www. 4s. nu/ 4s-f% C3% B6ren ing/ dialo gst% C3% B6d-44852 774 (only in Swedish).
The support is an interactive web-based instrument to be used in shared decision-making with the patient when discussing surgery for different spinal disorders. After being translated to English and discussed with the Eurospine board, it was made publicly available at Eurospine Home page in October 2020; https:// app. molni fy. com/ app/ 7wqw6 owgrz nr76b kaqc6 l4bs7q.
In Fig. 1a and b, two examples of prediction of outcome using the Dialogue Support are demonstrated. To the left on the screen picture, the patient's individual values of predictor variables are recorded. The right side demonstrates the predicted outcomes for that patient based on the patient's characteristics combined with algorithms describing the relationship between patient characteristics and outcomes based on large amounts of historical data. The pie chart shows predicted probabilities for the five alternatives of pain change and in the banner at the top of the screen, the predicted pain change and satisfaction with outcome are dichotomously summarized into a percentage of success and satisfaction (for description of outcome variables see Methods). The tool can be used by the reader using the following link to the Eurospine Home page: http:// www. euros pine. org Patient-centred outcome prediction is a growing focus in spine research, producing several reports annually. The majority discuss prediction in terms of Patient-Reported Outcome Measures (PROMs) [5][6][7][8][9][10][11][12][13][14][15], a few deals with adverse events [16,17], length of stay [18], revision surgery [19] or return to work [20]. Among the PROM analyses, the outcome measure is usually dichotomized. The predictive modelling differs, but the most frequently used is based on multivariate logistic regression algorithms [5,6,9,11,13,14]. In recent years, machine learning has gained interest as an alternative [9,12,21,22]. Further details of available reports related to degenerative spine surgery are presented in Table 1.
The aim of the current study is to evaluate the predictive precision of the Dialogue Support.

Methods
The dialogue support, www. euros pine. org The Dialogue Support is predicting outcome 1 year after surgery for patients with selected spinal disorders. The underlying prediction models used have been trained on a sizable body of data throughout Sweden during a 10-year period and are updated every year. The data quantity thus always includes outcomes no more than 1 year old.
The prediction is demonstrated as a proportion of a specific patient group achieving a certain outcome after surgery and answering PROM after 1 year, here global assessment ("Totally pain free-Much better-Somewhat better-Unchanged-Worse") [23] and satisfaction. Each prediction algorithm (one per diagnostic group) is based on approximately 2000-12,000 individuals, depending on the diagnosis and baseline profile. The included diagnosis groups are lumbar disc herniation (LDH), lumbar spinal stenosis (LSS), lumbar degenerative disc disease (DDD) and cervical radiculopathy (CR), which is caused by either disc herniation or foraminal stenosis.

Data source
Swespine in its current form was started 1998 and to date includes approximately 155 000 operated patients (i.e. index procedures used for predictive evaluation) with degenerative conditions in the lumbar and cervico-thoracic spine. National coverage is 95%, completeness 85% and 1-year follow-up over 70%.

Participants in the current study
Patients with their index surgery between 2007-01-01 and 2019-04-01, and who had one-year follow-up, were included in the analyses, resulting in a total of 87 494 patients: 23 087 with LDH, 51 390 with LSS and 5 872 with DDD 7 154 with CR. All patients have given consent prior to registration in Swespine, including information that their data will be used in clinical studies and that they can withdraw their individual data from the registry at any time. The evaluation procedure follows the TRIPOD recommendations (Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) [24].
All analyses were performed separately for the four different subgroups. Patients with missing data on the outcome variable or any of the explanatory variables were excluded. No imputations for missing data were performed.

Outcome
The two outcome measures used were: is an ordinal type Likert scale with six response alternatives; "How is your back/leg pain today as compared to before the surgery?" where 0 represents no back/leg pain before the surgery, 1 = completely pain free, 2 = much better, 3 = somewhat Fig. 1 a Exemplifying case: The Dialogue Support used for a woman, 40 years, with a lumbar disc herniation; leg pain for 3 months, NRS (Numeric rating Scale) = 7/10, no back pain, quality of life (EQ-5D) = 0.3/1.0, function (Oswestry Disability Index) = 50/100. She has had no previous spine surgery and is a non-smoker. Assessment: Based on the prediction model, there is an 89% probability for a patient with these characteristics to report a successful outcome (dark and light green sectors added). b The "same patient" is being a smoker, has had previous spine surgery, is 60 years old and has also back pain (NRS) 6/10 and comorbidities. Assessment: Based on the prediction model, there is a 56% probability for a patient with these characteristics to report a successful outcome   Ext Plots better, 4 = unchanged, 5 = worse [23]. Patients responding with the 0-alternative are excluded in the analyses. For the cervical spine, the question relates to neck/arm pain. Leg pain is the outcome in Spinal stenosis and LDH groups, back pain in the DDD group and arm pain in the CR group. In the summarized presentation, GA was dichotomized into success (response alternatives 1 and 2) and failure (response alternatives 3-5). B. Satisfaction (SAT) with treatment outcome, an ordinal Likert scale with three response alternatives (satisfied, hesitant and dissatisfied). In the analysis this variable was dichotomized to satisfied and not satisfied (hesitant or dissatisfied).

Predictors
A large number of socio-demographic and clinical baseline factors deemed relevant for predicting outcome were evaluated for inclusion in the models. Eligible predictors were all demographic and baseline data in Swespine. Link (in Swedish); http:// www. 4s. nu/ swesp ine-formul% C3% A4r-44871 294.

Statistical analysis
Development of algorithms underlying the prediction model was carried out in three steps.

Model development and variable selection
For the analysis of GA, an ordered probit model was estimated to account for the five levels in the outcome variable. The analysis of satisfaction was based on a logistic regression model. Backward variable selection based on Akaike information criterion was employed to select variables for inclusion in the final model. Model selection was performed out of sample by randomly splitting the data into a training data set (80% of sample) and a test data set (20% of sample).

Model validation
To evaluate accuracy of the model, calibration plots and receiver operating characteristic (ROC) curves were used [25]. In the calibration plots, patients were divided into 20 equally sized groups based on predicted value for their outcome, and then, the average actual outcome within each group was calculated. Diagnostic ability was also evaluated using receiver operating characteristic (ROC) curves, where the ability to classify patients into the correct group is assessed when varying the discrimination threshold. The dotted line indicates that the model predicts no better than chance, described by an area under the curve (AUC/c-statistic) of 0.5. A line extending to the upper left corner indicates perfect separation with an AUC/c-statistic of 1.0. In these analyses, GA was dichotomized into success (response alternatives 1, 2) and failure (response alternatives 3, 4, 5)

Model estimation on entire sample
Once variable selection had been performed in step 1 and validated in step 2, the final models were re-estimated on the entire data set, i.e. both the training data and the test data, to get the largest possible number of observation for the final model parameter estimation.
A separate model is estimated for each of the 4 diagnoses of patients and for each of the 2 outcomes, resulting in 8 ROC and 8 calibration plots.
Data management and statistical analysis were conducted using R Statistical Software version 4.0.0.

Description of the study population
The eligible study population was reduced because of dropouts at follow-up and missing data of predictor variables as shown in Table 2. In general, the differences of baseline data between the diagnostic groups were small or moderate. The DDD group had the longest duration of pain and the highest frequency of earlier spine surgery ( Table 3).  The plots demonstrate how well predicted probabilities agree with actual outcome for subgroups with different case mix. Observations were ranked according to the predicted value and grouped in 20 categories in the lumbar diagnostic groups and in 5 categories in the cervical group. The proportion with actual successful outcome (y-axis) and predicted value (x-axis) was calculated for each category and plotted against each other. The solid line represents perfect calibration and dotted line represents the actual results. The concordance between prediction and actual outcome measured with GA for success and satisfaction on a group level was high, with small differences between diagnostic groups. Satisfaction in the DDD group was least concordant. The findings are demonstrated in Fig. 2.

II. ROC curves
The ROC curve demonstrates the ability of the model to separate successful cases and failures. As shown in Fig. 3, the ability of the prediction models to discriminate between successes and failures on individual level is fair, with AUC ranging from 0.67 to 0.68. There were slight differences in model fit between the diagnostic groups. For satisfaction, the AUC value varied more between diagnostic groups, from 0.6 for DDD to 0.67 for CR. Table 4 presents the effect of each predictor on the two outcomes for the four different patient groups. Indicators of lower socio-economic status, such as smoking, disability pension and unemployment, were consistently associated with lower satisfaction and less pain improvement. Previous spine surgery was a negative predictor for all diagnostic groups. Short duration of back/neck pain was associated with more pain improvement and higher satisfaction in all diagnostic groups. Short duration of leg/arm pain was associated with more pain improvement and higher satisfaction in the LDH, CSS and CR groups. Age and gender were of minor importance, as was also ODI, whereas a higher quality of life (EQ-5D) at baseline predicted higher satisfaction at follow-up for all but the CR group.

Discussion
The Dialogue Support, based on the national Swedish quality register "Swespine", presents the predicted outcome after surgery for degenerative spinal disorders. The outcome is presented as a percentage of outcome according to global assessment of pain (GA) and satisfaction with outcome based on the patient's characteristics combined with algorithms describing the relationship between patient characteristics and outcomes based on large amounts of historical data.
The calibration plots of subgroups demonstrate a high degree of concordance, with minor differences between diagnostic groups. The message to the patient can be expressed as follows: in the group of earlier operated individuals with a similar baseline profile as you, a certain percentage reported after 1 year "complete relief of pain", another percentage reported "much better", a third percentage "somewhat better", etc. This is visualized in the pie diagram.
On an individual level, as estimated with ROC curves, the precision of the predictive model was fair. For global assessment of pain, the AUC value ranged from 0.67 to 0.68. For satisfaction, it ranged from 0.60 (DDD) to 0.66 (LDH). Other reports with PROMs as outcome measure and logistic regression as analytic method describe AUC values and c-indexes ranging from 0.64 to 0.79, mostly tested on singlecentre cohorts [5,6,9,11,14].
In recent years, new computer based analytic methods, often called "machine learning" and "deep learning", have been proposed as possibly more powerful methods of data acquisition and analysis. The suggested advantage of these techniques appears not to be determined [26]. Reported AUC values or c-indexes with these analytic methods range from 0.59 to 0.90 [8,12,21,22]. Probably the more critical aspect of outcome prediction, and possibilities of improving precision, is related to addition of more predictor variables. This appears to be more important than focusing on analytic techniques, although different analytic models may change the predictive potential of a particular data set, as has been demonstrated on a hip and knee arthroplasty cohort [27].
The limitation of the current model as such is related to the baseline and socio-demographic variables at hand. Evaluation and development of the Dialogue Support is a continuous process. There could be some changes in the surgery techniques, processing, trainings, support, machineries and medicines affecting outcome. This is considered by yearly updating of the database. The yearly updating of the database and introduction of new baseline variables are also expected to increase the precision of the model on the individual level over time. Dropouts may have a tendency of worse outcome than consenters also in registers [28], so there is a possible risk of overestimating success in the model. However, this is still an unsolved question [29].
A possible selection bias could be caused by the proportion of patients not having their index surgery recorded. However, there are only exceptional occurrences of patients opting out from Swespine. The major cause of loss to registration is deficient routines in some of the participating centres, not dependent on individual patients. In our interpretation, this does not cause any bias in the national register. A second possible cause of selection bias would be loss to follow-up. This has been assessed in a recent publication [28].
When it comes to application of the Dialogue Support in other countries, the Swedish reference database can be a limitation to generalizability. Spine patients in different countries have different cultural and social conditions, which may affect the predictive ability of the model. Interpretation of predictions should be done with this in mind, until validation tests have been performed. Ideally the model should be tested on a national, or other large, database in the country in question.
The strength of the Dialogue Support is related to the large reference population in the national Swespine Registry, which has a coverage of 95%, a completeness of 85% and a one-year follow-up of more than 70%. Thus, the data that the prediction model is based on well represent the "degenerative spine population" in Sweden over the last ten years. Predictions can thus be generalized in the   entire nation and applied to all spine centres. Dropouts and missing data may infer a limitation to the generalizability, which we hope to reduce with increasing web-based registration.
In the aggregated perspective, the Dialogue support, acting as "one piece of the puzzle", can support the clinician's clinical experience/judgement. In popular terms, using the Dialogue support, it is possible to describe outcome as a certain probability to have a successful and satisfactory outcome after a proposed surgical intervention.
Thus, the Dialogue support offers the opportunity to both patient and surgeon to contemplate and discuss the probability of benefit and risk based on more substantial evidence than the experience and conjecture of an individual surgeon. It is also an interesting opportunity to start international research cooperation based on the Dialogue support.
Our ambition is to validate these models further once new data becomes available (data are continuously collected in Swespine, approximately 10 000 patients per year), and we can also foresee validating models in other countries where similar (albeit not identical) data are collected.

Conclusion
The Dialogue Support is a useful prediction tool with an accuracy which is high on a group level and moderate on an individual level. It can serve as an aid to both patient and surgeon when discussing a surgical treatment of degenerative conditions in the lumbar and cervical spine. analysis and editing of the manuscript.
Funding No funding was obtained.
For facility of interpretation, model coefficients are presented on the log scale (logs odds ratios). As opposed to the odds scale, the log odds scale is symmetrical around zero. GA is coded so that 1 is best assessment and 5 is worst assessment, while satisfaction is coded so that 1 is satisfied. Hence, a negative/positive coefficient implies that a higher value on the predictor is associated with lower/higher probability of worse global assessment and higher satisfaction. An empty cell implies that the variable was not included in the final model after the variable selection procedure *CR (cervical radiculopathy): Duration back pain corresponds to Duration neck pain, Duration leg pain to Duration arm pain, VAS back pain to VAS neck pain and VAS leg pain to VAS arm pain and ODI to NDI (Neck Disability Index) Data Availability Individual data are not publicly available but on a group level through yearly reports by the steering committee of Swespine (www. 4s. se).

Declarations
Conflict of interest JM holds stock in LOGEX Healthcare Analytics, a company specialized in analysis of health care data. The other authors declare no conflict of interest.
Consent for publication All patients in Swespine are given written information about the registry, including information that their data can be used for research and publication if they accept to participate in the register and that they can withdraw their data at any time.
Ethical Approval Retrospective data are used in the current article. All data are anonymized. There are no individual data in the Dialogue support, but all data are on group level. Data are national and cannot be traced to specific location or to an individual, and are not included in the medical record. According to Swedish law, this implies that ethical approval is not needed.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.