Introduction

Patients today are, correctly, much more involved in decision-making regarding their treatment than they used to be [1]. Physicians and surgeons treating spinal disorders should be able to make evidence-based predictions regarding the outcome of their treatments, based on reliable prognostic information. In the last few years, a number of statistical prediction models have been developed to predict the outcome of spine surgery [2,3,4,5,6,7,8], mainly focusing on the benefits of the intervention regarding pain relief, quality of life improvement and/or return to work.

In view of recent developments in shared decision-making, not only the benefits but also the risks associated with different treatment modalities must be clearly communicated to the patient. For this reason, risk calculators have also been developed to predict complication rates [4, 9,10,11,12,13].

Lee et al. were the first to create a predictive model assessing the risk of medical complications following spine surgery and to develop an online tool for its clinical use [10]. Their study was based on a population of 1476 patients, split into two subsets for internal and cross-validation. Although successful, they acknowledged that the accuracy of such predictive models would be improved with greater power. Also, their model only evaluated the risk for general medical complications and lacked a surgical complication counterpart. Later on, they developed a model for surgical site infection—one of many possible surgical complications—on the same patient population [9]. The model showed reasonable success as far as its discriminative capacity was concerned, but no information was provided regarding how well the predictions were aligned with the observed outcomes (i.e. "calibration").

Kasparek et al. [14] sought to validate the general-medical complication model of Lee et al. [10]. In total, 44 patients developed a medical complication in a population of 273 patients undergoing spinal surgery. The model demonstrated adequate prediction of the medical complication risk group (low, medium, high), but the authors conceded that, with only 273 patients, the analysis may have been underpowered. Janssen et al. [15] investigated the validity of the surgical site infection model of Lee et al. [9] in a population of 898 patients undergoing thoracolumbar spine surgery. They demonstrated a Nagelkerke’s R2 of 0.01, indicating poor external predictive strength.

Kim et al. [16, 17] used artificial neural networks in addition to classic logistic regression methods to identify risk factors for various types of complication in two subsets of spine patients: those undergoing elective adult spinal deformity surgery [16] and those undergoing posterior lumbar spine fusion [17]. They trained their multivariable models using data from the American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP) database, and model performance was compared with that of a model containing just the American Society for Anesthesiology (ASA) score as predictor. The areas under the receiver operating characteristic curves (AUC) were better for the multivariable models (0.54–0.84, depending on the complication in question) than for ASA alone (0.37–0.52). This hence showed promising results in the advancing field of artificial neural network models. However, no online tool was developed for its further use or the assessment of its validity in the clinical setting. Indeed, neural networks are notoriously more challenging to use for the development of decision-support systems, since the most important input variables are more difficult to identify than they are in regression models.

The aforementioned studies focused on medical complications based on small databases, were conducted as single-centre studies, had poor external validity, or used novel statistical/machine-learning approaches to produce models that do not easily lend themselves to further validation by others.

In the present study our aim was to combine previously identified individual predictors of outcome, e.g. number of previous spine surgeries [18, 19], age [20], ASA score [21], complexity of the surgery [22], BMI [23], and smoking status [24, 25] in a multivariable model to predict complications, using a large multicentre dataset from the EUROSPINE “Spine Tango” registry [26]. We sought to develop two models and a web-based tool for predicting the likelihood of incurring a perioperative complication in connection with spine surgery: one focusing on general medical complications and the other, on surgical complications.

Methods

Source of data

This was a retrospective multicentre registry data-based study of prospectively collected data within the EUROSPINE “Spine Tango” Registry [26]. The analysis of these routinely collected, anonymised data was approved by the Swiss Ethics Committees for research involving humans [27].

The registry currently includes over 120 ‘000 surgical patient cases from several spine centres from more than 20 countries. Medical history and surgical details are documented by the surgeon using the Spine Tango surgery form, as are surgical and general medical complications arising between admission and discharge.

Participants

Inclusion and exclusion criteria

Patients included in the present study were operated on between January 2012 and December 2017. All included patients had spine surgery for degenerative disorders of the lumbar spine in one of the participating EUROSPINE Spine Tango centres. Patients aged between 18 and 95 were identified on the basis of a "main pathology" documented as "degenerative disease", and "level of intervention" as “thoracolumbar”, “thoraco-lumbo-sacral”, ‘‘lumbar’’, “lumbo-sacral” or ‘‘sacral” on the Spine Tango 2011 Surgery form. All patient cases with missing values for the variables indicated below were excluded from the analysis (see later).

Outcome

We defined two independent binary (presence/absence) outcome variables describing perioperative complications. First, the occurrence of a surgical complication arising either intraoperatively (nerve root damage, spinal cord damage, dura lesion, vascular injury, fractures of vertebral structures) and/or postoperatively before discharge (epidural or other hematoma, radiculopathy, CSF leak/pseudomeningocele, motor/sensory dysfunction, bowel/bladder dysfunction, wound infection, implant malposition, implant failure, wrong level). Second, the occurrence of a general complication arising either intraoperatively (anaesthesiological, cardiovascular, pulmonary, thromboembolism, death) and/or postoperatively before discharge (cardiovascular, pulmonary, cerebral, kidney/urinary, liver/gastrointestinal, thromboembolism, death).

Predictors

The selection of predictors was based on evidence of their predictive capacity in the current literature [18,19,20,21,22,23,24,25, 28]. All the predictor variables are collected systematically in the registry and are typically known or decided upon preoperatively, on the "admission" section of the Spine Tango Form. They included: age; sex (male, female); whether patients had had previous spine surgery (if so, whether at the same or on a different spinal level); body mass index (BMI, < 20, 20–25, 26–30, 31–35, > 35 kg/m2); smoking status (no, yes); morbidity state [using the American Society of Anaesthesiologists Physical Status Score (ASA; scored 1–5)]; the modified Mirza Invasiveness Index (representing the invasiveness of the planned surgical procedure) [29, 30]; the presence of an additional pathology (other than degenerative disease, such as fracture/trauma, non-degenerative deformity, infection, tumour, etc.); planned preoperative prophylaxis (infection, thromboembolism, ossification); and planned intraoperative technology (conventional or minimally invasive spine surgery (MISS)/less invasive spine surgery(LISS)).

Sample size

Sample size considerations for prediction models need to take into account whether the outcome number of events is large enough for fitting multiple predictor models, taking all relevant predictors into account simultaneously. The generally accepted rule is to have at least ten events per variable [30,31,32]. Given that, in the planned study, 10 independent predictor variables were to be evaluated simultaneously, this implied at least 100 events (occurrence of a complication) observed for each of the outcomes, a condition that was met for the generation of separate models to predict general medical complications and surgical complications (see later).

Missing data

A complete case analysis was used for the development of the prediction models. As such, cases with any missing data regarding the previously noted predictors or outcomes were excluded from the analysis. The results were reanalysed at a later stage in a sensitivity analysis based on imputed data.

Statistical analysis methods

Descriptive statistics included median and interquartile ranges or mean and standard deviations (SD) for continuous variables and counts and percentages of total for categorical variables. The two binary outcome variables, general medical perioperative complications and surgical perioperative complications, were addressed with multiple logistic regression models fitted to each outcome.

Predictor models with many prognostic indicators tend to fit the data used in the study optimally, but predictions for new subjects perform less well. This problem is known as overfitting. To address overfitting in our models, we used shrinkage, a technique in which the regression coefficients of the prediction models are multiplied by a global shrinkage factor (a real number < 1), leading to a reduction in their values ("shrinkage") towards zero. We used the dfbeta method [33] to derive a separate global shrinkage factor for each of the two models. This method is equivalent to leave-one-out cross-validation (LOOCV), which is a common but less efficient approach that is difficult to apply to large datasets [33] such as the present one. Original and shrunken regression coefficients were presented as odds ratios (OR) with 95% confidence intervals (CI).

Model performance

To assess model performance, we evaluated how well the predicted probability of a complication corresponded to the actual observed complication rate, by assessing the model's discriminative ability and calibration. Discrimination was examined using the area under the receiver operating characteristic curve (AUROC, or c-statistic) with 95% CI. In the ROC curve, sensitivity is plotted against 1-specificity. In general, an AUC of 0.5 suggests no better discrimination than tossing a coin, 0.6–0.7 is considered possibly helpful, 0.7–0.8 is considered acceptable, 0.8–0.9 is considered excellent, and more than 0.9 is considered outstanding [34].

Calibration of a prediction model measures the agreement between observed outcomes and predictions. In the present study, internal calibration of the two models was assessed using calibration plots. Internal calibration refers to agreement between observed and predicted probabilities in the sample in which the model was developed, showing how well the model represents the observed reality and whether it tends to over- or underestimate the probability of an event [35,36,37].

Sensitivity analysis

A sensitivity analysis was performed to assess the potential bias in the results as a consequence of using complete case analysis. For this reason, a single dataset was imputed based on the assumption that data were missing at random (MAR). The imputed dataset was obtained from doing a single imputation, in which missing values for our predictors (e.g. missing smoking status, BMI or ASA score) were filled using multivariate imputation based on chained equations [38, 39]. After single imputation the coefficients of the two models, corresponding AUC’s and their calibration plots were re-estimated, using the aforementioned methods.

All analyses were conducted using R for Windows [40], using the packages openxlsx, tableone, pROC, tidyverse, shrink [33], boot, biostatUZH, mice [39] and gbm. The work was carried out following the concept of reproducible research, and the R-code is available upon request [41]. The results of the study were reported according to the TRIPOD guidelines [42].

Results

Participants

Figure 1 shows the flowchart for patient inclusion in the study. In total, 68′111 cases were registered in the database at the time of data export. Of these, 54′452 were degenerative cases, of which 43′557 included the lumbar spine (according to the definition above). Selecting patients between age 18 and 95 years resulted in 43′461 cases. Using a complete case approach to the analysis, the final sample size was N = 23′714, with the total number of cases excluded being 19′747 (45.4%). The variables with the most missing data were smoking status (39.4%), morbidity status (18.2%) and BMI (15.4%). The baseline characteristics of the final study group are shown in Table 1. The mean age was 58.9 (15.7 SD) years, and 11,450 (48.3%) were males. In total, 16,921 (71.4%) patients had had no previous surgery. Patients were most frequently (9176; 38.7%) in the BMI category 26–30 kg/m2. Most participants, 18,799 (79.3%), did not smoke. The most common morbidity state was ASA-2 (12,941 (54.6%) patients). The median Mirza score was 2 (interquartile range 1 to 7). Most participants had infection prophylaxis (22,832; 96.3%), and the majority had thromboembolism prophylaxis (17,754; 74.9%). There were more conventional technologies used (9437; 39.8%) than MISS/LISS (2502; 13.7%); the most commonly used technology was microscope (14,268; 60.2%).

Fig. 1
figure 1

Patient selection flowchart

Table 1 Baseline characteristics of the study group

Details regarding the incidence of the different types of complication are shown in Table 2 for general medical complications and Table 3 for surgical complications. Overall, 763/23,714 (3.2%) patients had a general medical complication and 2534/23,714 (10.7%), a surgical complication, indicating that sufficient events were observed to be able to fit the multiple prediction models. The most common intraoperative general complication was of a cardiovascular nature (25; 0.11%) and the most common postoperative general complication, postoperative kidney/urinary problems, being reported in 200 (0.84%) cases. The most common intraoperative surgical complication was dural tear, being reported in 1638 (6.91%) cases, with motor dysfunction being the most common postoperative surgical complication 168 (0.71%).

Table 2 General complication counts and percentages in the group of 23,714 patients
Table 3 Surgical complication counts and percentages in the group of 23,714 patients

Model development

Model predicting general medical complications

The calculated shrinkage factor for the general medical complication model was 0.98, indicating that not much overfitting was present. The odds ratios, their 95% confidence intervals (CI) and the p values for the shrunken prediction model for general medical complications are shown in Table 4. Higher age (OR 1.03, 95% CI 1.03–1.04 per year) was associated with greater odds of having a complication. An ASA score of 2 or more was also associated with greater odds of having a complication, with the effect being more marked the higher the ASA score (ASA 2, OR 1.6, 95% CI 1.2–2.12; ASA 3, OR 2.98, 95% CI 2.19–4.07; ASA 4, OR 5.62, 95% CI 3.04–10.41), as were more complex procedures according to the modified Mirza score (OR 1.03, 95% CI 1.02–1.04 per point increase) and conventional surgical technology having been used (OR 1.32, 95% CI 1.12–1.54). Using infection prophylaxis (OR 0.59, 95% CI 0.37–0.92) was associated with reduced odds of a general medical complication.

Table 4 Shrunken regression coefficients for the general complications model

Model predicting surgical complications

The calculated shrinkage factor for the surgical complication model was 0.97. The odds ratios, their confidence intervals (95% CI) and the p values for the shrunken prediction model for surgical complications are shown in Table 5. Higher age (OR 1.02, 95% CI 1.01–1.02), previous spine surgery at the same level (OR 1.9, 95% CI 1.71–2.12), BMI over 35 (OR 1.29, 95% CI 1.00–1.67), an ASA score of 3 (OR 1.23, 95% CI 1.06–1.43), a higher modified Mirza score (OR 1.01, 95% CI 1.00–1.01), any additional spine pathology (OR 1.3, 95% CI 1.14–1.49), ossification prophylaxis (OR 2.18, 95% CI 1.38–3.46) and conventional surgical technology having been used (OR 1.12, 95% CI 1.02–1.22) were each associated with an increased odds of having a surgical complication. Male gender (OR 0.81, 95% CI 0.75–0.89) was associated with decreased odds of incurring a surgical complication, as was using thromboembolism prophylaxis (OR 0.85, 95% CI 0.77–0.94).

Table 5 Shrunken regression coefficients for the surgical complications model

Model performance

The ROC’s for the models can be seen in Fig. 2 (general medical complications) and Fig. 3 (surgical complications). The AUC for the model for general complications was 0.74 (95% CI: 0.72–0.76), while that for surgical complications was 0.64 (95% CI: 0.62–0.65) after shrinkage. The calibration plots for the two models are shown in Fig. 4 (general complications) and Fig. 5 (surgical complications). In the calibration curve for the general complications model, the observed values agreed well with the predicted values up to a predicted probability of 0.3. However, beyond this point, higher predicted probability values corresponded to much lower observed values, and the confidence intervals increased markedly. The same pattern was seen for the surgical complications model, but there the inflection point was reached at a predicted probability of about 0.25.

Fig. 2
figure 2

ROC curve general complication model. AUC 0.74 (95% CI: 0.72–0.76)

Fig. 3
figure 3

ROC curve surgical complication model. AUC 0.64 (95% CI: 0.62–0.65)

Fig. 4
figure 4

Calibration plot for general complication model. The y-axis describes the observed average probability of complications; x-axis describes the models corresponding to predicted values.The red line indicates optimal calibration; the black line represents the models' calibration with confidence limits given by the yellow area

Fig. 5
figure 5

Calibration plot surgical complication model. The y-axis describes the observed average probability of complications; x-axis describes the models corresponding to predicted values. The red line indicates optimal calibration; the black line represents the models ‘calibration with confidence limits as yellow area

Sensitivity analysis

The estimated coefficients for the predictors in the dataset with single imputation are shown in "Appendix" (Tables 8 and 9). Shrinkage factors calculated for the single imputation method were 0.985 for general medical complications and 0.980 for surgical complications. Recalculating the AUC for the single imputation dataset resulted in an AUC of 0.75 (95% CI: 0.74–0.76) for general medical complications and an AUC of 0.64 (95% CI: 0.63–0.65) for surgical complications. These results were then compared to the complete case dataset. The ROC curves and calibration plots for the sensitivity analysis can also be found in "Appendix" (Figs. 6, 7, 8, 9).

Discussion

Summary and Interpretation

We developed two models to predict general medical and surgical complications during spine surgery, based on the data collected within the EUROSPINE Spine Tango registry over a period of 6 years. The issue of overfitting was addressed by using shrinkage.

The ASA grade had the largest odds ratio for the probability of incurring a general complication, with a higher grade increasing the odds. The effect of the ASA grade on the incidence of general and surgical complications in spine surgery has been shown in numerous previous studies [21, 44,45,46,47].

Previous spine surgery at the same vertebral level had the largest per point odds ratio for an increased probability of a surgical complication. Nonetheless, very high values in continuous variables such as age or the modified Mirza score, which also increased the odds of a complication, could potentially exceed the per point odds ratio of previous spine surgery. The effect of previous spine surgery is already known and has been quantified before in data from the same registry [18, 19], and higher age and/or more invasive procedures have also been shown in previous studies to be associated with a greater likelihood of incurring a surgical complication [22, 43]. Our model also showed an increase in odds for surgical complications when additional non-degenerative spine pathologies are present. A possible explanation for this could be that the additional pathology makes the intervention more difficult and non-standard methods have to be used.

The models developed in the present study can be used when discussing a possible surgical intervention in a shared decision-making situation, along with models predicting the possible benefits or expected average outcomes, where models/baseline variables to predict individual outcome are not available. By using the regression coefficients from the model, one can estimate the risk of having either a surgical or medical complication in an individual patient. For example, a male patient, age 40 y, body mass index between 20 and 25 kg.m−2, non-smoker, without previous spinal surgery, ASA score of 1, undergoing L5–S1 posterior discectomy without posterior fusion (Mirza Score = 1), no additional pathologies, using prophylaxis for infection and thromboembolism have a calculated risk of 5.11% for having a surgical complication. The same patient has a calculated risk of 0.87% of suffering a medical complication.

In comparison, a female patient age 59 y, body mass index between 26 and 30 kg.m−2, smoker, previous surgery at the same level, ASA score of 2, undergoing L4–S1 posterolateral fusion with pedicle screws and no decompression (Mirza score = 6), no additional pathologies, using infection and thromboembolism prophylaxis have a calculated risk of 16.8% of experiencing a surgical complication and a calculated risk of 2.76% of developing a medical complication. This preoperative knowledge might help patients and surgeons alike to decide upon their next treatment steps that could be taken pre- or perioperatively to minimise the risk of a complication, e.g. smoking cessation, weight loss, less invasive surgery. Based on the predictor models, a freely available, web-based prediction tool has been developed https://sst.webauthor.com/go/fx/run.cfm?fx=SSTCalculator.

To assess the performance of our models in terms of discrimination, we calculated ROC curves and their AUC’s. The model predicting medical complications showed an AUC of 0.74, which can be considered as acceptable, and which was similar to the AUC of 0.76 reported for the model of Lee et al. [10]. The discriminative ability of the model predicting surgical complications was less good, with an AUC of just 0.64, but might still be considered possibly helpful. A possible explanation for the lower AUC for the surgical complications model might be that surgical complications in general could be less predictable and might be more dependent on the surgeon skill and experience rather than the patient's baseline characteristics that have been included in our model, although previous studies have shown that the chosen variables do have a predictive value for complications in degenerative spine surgery. In comparison, general complications could depend more on the physical status of the patient and hence be better able to be predicted.

Internal calibration was assessed using calibration plots. In the calibration plot for the general complications model we saw that the observed probabilities agreed well with the estimated probabilities for low predicted probabilities, and thus, the model shows a relatively satisfactory goodness-of-fit in that region. Beyond a predicted probability of approximately 0.3, however, the predicted values showed a clear overestimation of the reality and thus led to an overprediction of complication risks. The same applied to the model for surgical complications, which tended towards overestimation from a probability of about 0.25 onwards. One possible explanation for this could be that the complication rates in our dataset were low and therefore the risk of, and accuracy in predicting, an unfavourable event was also very low. In other words, in reality there were no cases with high risk and thus high-risk cases could not be used to train the model. Either way, this phenomenon should be considered, if a comparatively high probability above 0.25–0.30 is given during the risk assessment, when using these models.

The aforementioned findings were confirmed in sensitivity analyses using single imputation for missing data. In these we found similar tendencies regarding the influence of the different predictor variables, the calculated AUCs for the ROC analyses and the calibration curves.

Tendencies for overfitting of our models appeared to be very small, as our shrinkage factors were close to 1. This suggests that the models may be applicable to populations outside of the given study group, although this should be verified by external validation.

Reported complication rates in spine surgery for degenerative disease range from 3.7 to 16%, depending on the definition of complication used and the technique being focused on [48,49,50]. In our population we found 3.2% medical complications and 10.7% surgical complications. A problem in the current literature is the reliability of complication reporting since there are no generally acknowledged reporting standards [51]. Overall, we found no notable discrepancy with the previously reported complication rates in our population group. When the number of cases in the registry has increased sufficiently to include a variable reflecting, for example, the type or geographical region of the contributing centres, it may be possible to calibrate/customise the model to accommodate differences in the thresholds for reporting.

Strengths and limitations

We view the use of a large multicentre database for model development as a relative advantage regarding the prediction of complications in lumbar spine surgery, when compared with other models previously developed. The model developed by McGirt et al. [4] used a single-centre database. The ACS-NSQIP [13] was developed using a heterogeneous patient population, and by only accepting one Current Procedural Terminology (CPT) code for each risk calculation, it may have underestimated the complexity of the surgery [52, 53], since many spine surgery cases comprise multiple procedures in one operation. In our model, complex spine surgery is taken into account using the modified Mirza index [29, 54]. Ratliff et al. [12] used the large MarketScan dataset, which included mostly younger patients. As the authors themselves conceded, this can be seen as a major limitation, since an increasing number of spine patients are elderly, due to the effects of degenerative disease on the aging spine [55]. A drawback of Lee et al.’s predictive model [9, 10] is that it was developed using only 1476 patients, which may not be a large enough sample. In addition to the above, our prediction of surgical complications included a range of different types, both intraoperatively and prior to discharge, rather than only surgical site infection [9], although our model was not able to specifically predict the likelihood of incurring any particular surgical complication or its severity. Including the modified Mirza index as a measure of surgical invasiveness obviated the need to sub-divide the data to produce separate predictor models for different spine surgical procedures of varying complexity and hence improved the overall power of the results.

A limitation of our study was that the prediction models were not validated using an external dataset. Although we tried to avoid overfitting of our models by using shrinkage, this is not a substitute for an actual external validation.

Another limitation of the study was the complete case analysis approach (i.e. we only included patients for whom there were no missing data), which could have introduced selection bias. The missing data are the result of the incomplete filling out of the Spine Tango surgery forms by the participating surgeons. We tried to address this limitation by performing a sensitivity analysis using single imputation and found our results to be robust. In the present study there was no analysis of complications occurring after hospital discharge, since the collected data were focused on complications during the hospital stay, as recorded on the Tango Surgery form. Finally, surgeons and patients using these models should keep in mind that the developed models tend to overestimate risks when assessing higher risk situations, as discussed in detail above.

We also emphasise that not only the risks, but also the benefits of surgery should be communicated adequately to the patient, such that both can be taken into account when making decisions about surgery. As such, any risk calculations using our model should ideally go hand in hand with estimates of pain and functional improvement when planning surgery.

Implications for future research

To ensure their broader applicability, our models should be externally validated. Also, separate prediction models could be developed for different spine pathologies, which would further increase their applicability. Analogous models could also be built looking at the risk of complications after discharge using the data from the Spine Tango Follow-Up form, which documents complications arising after hospitalisation up to many years' follow-up. Similarly, models predicting the likelihood of clinically relevant improvements in patient-oriented outcomes should be developed.

Conclusion

We were able to build two predictor models that can be used to predict the probability of incurring a complication during or shortly after spine surgery (before discharge). Of the two models, general complications were able to be predicted with greater discriminative ability than surgical complications. Reoperations at the same level were a predominant predictive factor for surgical complications, and a higher ASA score showed the highest odds for general complications. Complication rates were in the expected range, as reported in the literature. A freely available, web-based prediction tool has been developed for the purposes of further testing and validation https://sst.webauthor.com/go/fx/run.cfm?fx=SSTCalculator

The findings of this study are relevant for patient counselling and informed and shared decision making.