Background

Anterior cervical discectomy and fusion (ACDF) is a common surgical procedure in the treatment of cervical spine conditions such as spondylosis or stenosis, causing symptoms such as radiculopathy and/or myelopathy [1, 2]. The anterior approach enables direct decompression of the spinal cord and reconstruction of the anterior column of the spine while providing access to the cervical spine along anatomic planes [3, 4]. According to the recent literature, the frequency of ACDF has increased by up to 400% since 2011 [5]. The increased practice of ACDF underscores the need to anticipate adverse postoperative outcomes preoperatively [6,7,8,9,10].

In an effort to control healthcare costs, emphasis is being placed on the use of registries and databases to track and establish risk-adjusted estimates for these outcomes. This has necessitated clinicians to manage extensive volumes of complex data, sparking the need for robust analytical techniques [11]. Machine learning (ML) algorithms, capable of leveraging high-dimensional clinical data, are increasingly employed to develop accurate patient risk assessment models, contribute to the development of guidelines, and adjust care to individual patient needs, thereby influencing healthcare decisions. These algorithms present several advantages over traditional prognostic models, often employing some form of linear or logistic regression. Firstly, ML seldom requires prior knowledge of primary predictors [12]. Secondly, these advanced ML algorithms often impose fewer constraints on the number of predictors used for a given dataset than logistic regression, proving beneficial in handling large datasets with numerous predictors, where associations between predictors and outcomes are not always obvious. Lastly, these algorithms can identify complex, nonlinear relationships within datasets, which are often overlooked by regression-based models [13]. Owing to these advantages, ML algorithms frequently outperform regression methods in terms of reliability and accuracy when applied to identical datasets [14, 15].

Several studies have demonstrated the predictive potential of ML models for various spinal procedures and pathologies, including ACDF [11, 16,17,18,19,20,21,22,23,24,25]. Yet, a vast majority of these investigations predominantly exist as feasibility studies, with a limited contribution towards the practical application of these models potentially in clinical environments. Our study seeks to address this gap by developing ML models focused on the prediction of short-term adverse postoperative outcomes after ACDF for degenerative cervical disease. We focus on short-term outcomes because they have critical implications for hospital reimbursements, surgeon evaluations, and patient recovery and satisfaction. Following model development, we intend to incorporate these models into an accessible web application, thereby demonstrating their pragmatic value.

Methods

The methodology employed is summarized with a flowchart in Fig. 1.

Fig. 1
figure 1

Methodology flowchart

Data source

Data for this study is from the American College of Surgeons (ACS) National Surgical Quality Improvement Program (NSQIP) database, which was queried to identify patients who underwent ACDF from 2014 to 2020. Detailed information about the database and data collection methods have been provided elsewhere [26].

Guidelines

We followed Transparent Reporting of Multivariable Prediction Models for Individual Prognosis or Diagnosis (TRIPOD) [27] and Journal of Medical Internet Research (JMIR) Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research [28].

Study population

We queried the NSQIP database to identify patients in whom the following inclusion criteria were met: (1) Current Procedural Terminology (CPT) codes for ACDF surgery (22,551, 22,552, 22,554, and 22,585), (2) elective surgery, (3) operation under general anesthesia, and (4) surgical subspecialty neurosurgery or orthopedics. We excluded patients with the following criteria: (1) emergency surgery, (2) patients with any unclean wounds (defined by wound classes 2 to 4), (3) patients with sepsis, shock, or systemic inflammatory response syndrome 48 h before surgery, (4) patients with American Society of Anesthesiologists (ASA) physical status classification score of 4, 5 or not assigned, (5) patients still in hospital after 30 days since the NSQIP database captures postoperative outcomes up to 30 days after surgery, and (6) patients who were discharged to hospice, those who left against medical advice, and those who passed away. We excluded patients with 30-day mortality since our preliminary analysis of our patient cohort yielded only 19 patients with 30-day mortality, thus we could not investigate mortality as an outcome of interest due to the very few number of patients. Additionally, we excluded patients who underwent concomitant posterior cervical spinal surgery and total disc arthroplasty with relevant CPT codes (22,590, 22,595, 22,600, 22,614, 22,856, 22,858, 22,861, and 22,864). We reviewed the International Classification of Diseases (ICD) 10 codes assigned to the patients as principal diagnoses to further identify those undergoing surgery for degenerative diseases. Using ICD codes, patients with diagnoses of a fracture, neoplasm, or infection were also excluded. To avoid the effect of any confounding effect from rare pathologies and ICD-10 coding errors, we excluded cases with ICD codes that were utilized less than 50 times in the total patient population.

Predictor variables and outcomes of interest

Variables from the NSQIP database that were supposed to have been known preoperatively were included as predictor variables. These included (1) demographic information such as age, sex, race, Hispanic ethnicity, height, weight, transfer status; (2) comorbidities and disease burden such as current smoker within one year, diabetes mellitus requiring therapy, dyspnea, ventilator dependency, history of severe chronic obstructive pulmonary disease (COPD), ascites within 30 days prior to surgery, congestive heart failure within 30 days prior to surgery, hypertension requiring medication, acute renal failure, currently requiring or on dialysis, disseminated cancer, presence of open wounds, steroid or immunosuppressant for a chronic condition, malnourishment, bleeding disorders, preoperative transfusion of ≥ 1 unit of whole/packed RBCs within 72 h prior to surgery, the ASA classification, functional status prior to surgery; (3) preoperative laboratory values such as serum sodium, blood urea nitrogen (BUN), serum creatinine, serum albumin, total bilirubin, serum glutamic-oxaloacetic transaminase (SGOT), alkaline phosphatase, white blood cell (WBC) count, hematocrit, platelet count, partial thromboplastin time (PTT), International Normalized Ratio of prothrombin time (PT) values, PT; (4) operative variables such as surgical specialty, single- versus multiple-level surgery.

The outcomes under investigation included prolonged LOS, non-home discharges, 30-day readmissions, and major complications. We defined the prolonged LOS as total LOS exceeding 90% of the entire patient population, which equated to ≥ 3 days. The discharge destination variable was dichotomized to delineate non-home discharges. In instances where patients necessitated further levels of care post-discharge, a non-home discharge destination was classified. This category incorporated destinations such as ‘Rehab’, ‘Skilled Care, Not Home’, ‘Separate Acute Care’, ‘Unskilled Facility Not Home’, and ‘Multi-level Senior Community’. Discharges to a ‘Facility Which Was Home’ were categorized as home discharges in addition to discharges to ‘Home’. Patients considered to have experienced major complications if they developed one or more of the following after surgery: deep incisional or organ/space surgical site infections, wound dehiscence, reintubation, pulmonary embolism, prolonged mechanical ventilation beyond 48 h, renal dysfunction or outright failure necessitating dialysis, cardiac arrest, myocardial infarction, hemorrhage requiring transfusion, deep venous thrombosis, sepsis, or septic shock. The NSQIP database also contained data on some less serious postoperative complications like superficial surgical site infection, pneumonia, and urinary tract infection, but these were not classified as major for the purposes of this analysis. Any patients with missing data for any of the four primary outcome measures being examined were omitted from the related analytical procedures.

Data preprocessing and partition

We employed imputation to avoid any bias that might arise by excluding patients with missing data. The k-nearest neighbor imputation algorithm was utilized to fill in missing values in continuous variables after discarding variables that had more than 25% missing data [29]. For categorical variables, missing values were filled in with ‘Unknown’ or ‘Unknown/Other’.

To provide adequate data for the phases of model development, validation, and testing, we divided the 2014 to 2020 data into three subsets in a 60:20:20 ratio for training, validation, and test sets, respectively. The training set was used for training the ML models, the validation set for fine-tuning hyperparameters and calibration, and the test set for evaluating the models’ performance.

To address potential class imbalance in the training data, we employed the Synthetic Minority Over-sampling Technique (SMOTE) prior to model training. SMOTE counteracts skewed class distributions by artificially generating new examples belonging to the minority class, rather than duplicating existing samples [30]. This approach grows the number of instances from the under-represented class and has been shown to improve model performance compared to simply replicating minority samples. Applying SMOTE ensured adequate representation of all classes and avoided learning bias towards majority groups during the training process.

Model development and performance evaluation

We built our prediction models using five different ML algorithms. These ML algorithms comprised a transformer-based algorithm named TabPFN [31], a neural network-based approach called TabNET [32], two gradient boosting algorithms, specifically XGBoost [33] and LightGBM [34], and a decision-tree-based algorithm Random Forest [35]. In order to maximize these models’ discriminatory abilities, we utilized the Optuna optimization library [36], employing the area under the receiver operating characteristic (AUROC) as the optimization standard. We used the Tree-Structured Parzen Estimator Sampler (TPESampler), a Bayesian optimization algorithm, to provide AUROC estimates that would guide the optimization process. The finalized prediction models were developed using the training sets and the hyperparameters optimized with Optuna. These optimized hyperparameters can be found in Supplementary Table 3. We applied Platt scaling, also recognized as isotonic regression, for model calibration [37]. All these analyses were performed on Python version 3.7.15 on the Google Colab platform.

We conducted a thorough evaluation of our models’ performance, both visually and numerically. The visual assessment was accomplished through the receiver operating characteristic (ROC) and precision-recall curve (PRC), while numerical metrics used for classification performance evaluation included AUROC, balanced accuracy, weighted area under PRC (AUPRC), weighted precision, and weighted recall. Calibration was evaluated using the Brier score.

We chose models for web application deployment based on their AUROC values. AUROC, a widely used performance metric in ML models, is particularly beneficial in binary classification tasks [38]. This measure assesses a model’s ability to differentiate between positive and negative samples across various classification thresholds. We chose AUROC as a primary measure due to its multiple advantages. First, it is not affected by class imbalance, making it a suitable choice for datasets with uneven class distribution. Second, it considers the complete range of classification thresholds, providing a thorough evaluation of model performance across diverse points. Third, AUROC quantifies the model’s ability to correctly rank instances irrespective of the chosen classification threshold. By distilling the model’s performance into a single value, AUROC simplifies the comparison process among different models or algorithms. As a result, it offers a reliable reflection of the model’s discriminative power and is thus an appropriate metric for model evaluation and selection across various applications.

To enhance our models’ interpretability, we used SHapley Additive exPlanations (SHAP) to determine the relative importance of predictor variables [39]. In addition, we used partial dependency plots (PDPs) to display the effect of individual variables on the predictions of the top-performing models.

Web application

We developed a web application to allow users to make individual patient predictions. The top-performing models for each outcome were incorporated into this application. The source code for implementing these models online can be found on the Hugging Face platform, which a community-friendly site for sharing ML models. We have also included Supplementary Video 1 to demonstrate the web application’s functionality. The web application can be accessed via this link: https://huggingface.co/spaces/MSHS-Neurosurgery-Research/NSQIP-ACDF.

Descriptive statistics

For continuous variables with a normal distribution, we reported means (± standard deviations), and for those with a non-normal distribution, we presented medians (interquartile ranges). The patient count was represented as percentages for categorical variables.

Results

63,912 patients were identified with the inclusion criteria. Exclusion criteria were applied sequentially, and 6,053 patients were excluded (Fig. 2). After outcome-specific exclusion criteria were applied, there were 57,760 patients included in the analysis for the outcome prolonged LOS [n = 6,386 (11.1%) with prolonged LOS], 57,780 for the outcome non-home discharges [n = 1,913 (3.3%) with non-home discharges], 57,790 for the outcome 30-day readmissions [n = 1,694 (2.9%) with 30-day readmissions], and 57,800 for the outcome major complications [n = 794 (1.4%) with major complications. Characteristics of the patient population (n = 57,859) before the outcome-specific exclusion criteria were applied are presented in Table 1.

Fig. 2
figure 2

Patient selection flowchart

Table 1 Patient characteristics

Performance evaluation indicated that the top-performing models for each outcome were the models built with the Random Forest algorithm. The Random Forest models yielded AUROCs of 0.776 [95% confidence interval (CI), 0.766–0.792], 0.846 (95% CI, 0.809–0.855), 0.775 (95% CI 0.731–0.791), and 0.747 (0.702–0.779) in predicting prolonged LOS, non-home discharges, 30-day readmissions, and major complications respectively. These results indicate good success in distinguishing patients who had non-home discharges from those who did not. Fair discriminatory ability was seen in differentiatiating patients who experienced prolonged LOS, 30-day readmissions, and major complications [40]. Detailed information on these performance metrics is displayed in Table 2. Illustrated in Fig. 3 are radar plots, each corresponding to one of the four outcomes of interest. These charts serve as an instrument for multidimensional visualization, with each of the five axes standing for a separate performance indicator. The placement on each respective axis signifies the model’s performance in relation to that particular indicator. Consequently, these radar charts enable a comparative analysis of model performance across various metrics.

Table 2 Performance metrics of the models
Fig. 3
figure 3

Algorithms’ radar plots for the outcomes (A) prolonged length of stay, (B) non-home discharges, (C) 30-day readmissions, and (D) major complications

Figures 4 and 5, respectively, illustrate the ROCs and PRCs for the three outcomes, while Fig. 6 presents the SHAP bar plots for each outcome’s top-performing model. SHAP bar plots for other algorithms for each outcome are available in Supplementary Fig. 1 through 4. SHAP bar plots give a general overview of the significance of features in a model. Each bar in these plots represents the importance of a feature, with its length corresponding to the average absolute SHAP value across all instances. This measure of importance shows the average effect a feature has on the model’s prediction. The features are arranged according to their significance, with the most influential at the top.

Fig. 4
figure 4

Algorithms’ receiver operating characteristics for the outcomes (A) prolonged length of stay, (B) non-home discharges, (C) 30-day readmissions, and (D) major complications

Fig. 5
figure 5

Algorithms’ precision-recall curves for the outcome (A) prolonged length of stay, (B) non-home discharges, (C) 30-day readmissions, and (D) major complications

Fig. 6
figure 6

The 15 most important features and their mean SHAP values for the model predicting the outcome (A) prolonged length of stay with the Random Forest algorithm, (B) non-home discharges with the Random Forest algorithm, (C) 30-day readmissions with the Random Forest algorithm, and (D) major complications with the Random Forest algorithm

Moreover, to better understand how individual feature values influence the model’s predictions, we refer to Supplementary Figs. 58, which present the PDPs for the models built with the Random Forest algorithm, each for one of the four outcomes of interest. As an illustration, Figure Supplementary Fig. 5 displays a non-linear curve for ‘Age’, indicative of a non-linear association between the feature ‘Age’ and the outcome, prolonged LOS. This underscores the advantage of ML algorithms in capturing non-linear relationships between variables and outcomes, a strength that traditional regression algorithms may not possess.

Discussion

The goal of our study was to develop ML models capable of predicting short-term adverse postoperative outcomes following ACDF. Furthermore, to make our models more accessible, we developed a web application that allows healthcare professionals to input patient data and receive predicted risks for each outcome. This web application has the potential to serve as a valuable tool for clinicians by facilitating the estimation of a patient’s risk of adverse outcomes following ACDF. These models can aid clinicians in identifying patients at high risk of adverse outcomes following ACDF, thus enabling more informed patient counseling prior to the procedure.

When interpreting the metrics used to assess model performance, it is crucial to handle with care and understand the use of imbalanced datasets for ML classification tasks. We used metrics such as balanced accuracy, weighted precision, weighted recall, and weighted AUPRC to evaluate our models’ classification performance. These metrics consider the data’s class distribution, assigning more importance to the minority class [41,42,43]. This facilitates a just evaluation of the model’s performance across both classes and a broader perspective of the model’s effectiveness, taking into account the class distribution in the data. Conversely, the unweighted versions of these metrics might not be reliable in situations with imbalanced datasets as they overlook the class distribution and could present a misleading impression of good performance by neglecting the minority class. Furthermore, interpreting AUPRC can be more complex than another area under the curve metric, AUROC, due to its distinctive baselines. AUROC employs a baseline of 0.5, depicting a random classifier’s performance, whereas the baseline for AUPRC is the proportion of positive examples in the dataset [44]. This can result in significantly lower AUPRC values than AUROC, especially for datasets with a small fraction of positive examples, like many real-world medical datasets. Nevertheless, AUPRC might be more relevant for a particular problem, but it is often reported less frequently than AUROC due to its lower absolute values. For instance, in our study, the weighted AUPRC for the Random Forest model predicting prolonged LOS was 0.473 (95% CI, 0.464–0.482), while the prolonged LOS rate was 0.112, representing the baseline. Lastly, we evaluated the models’ calibration using the Brier score, a measure of the average squared disparity between predicted and actual probabilities [45, 46]. A model calibrated well will have a Brier score close to zero, implying that the predicted probabilities align closely with the actual probabilities.

One interesting finding was that Random Forest models outperformed other modern algorithms like XGBoost and LightGBM in terms of predictive performance across all outcomes. Despite being around for many years [35], tree-based ensamble methods like Random Forest remain robust and powerful approaches for prediction problems. The Random Forest algorithm creates numerous randomized decision trees and aggregates their predictions, allowing it to capture complex nonlinear relationships and high-order interactions between variables [47]. In addition, their ensemble nature makes them resistant to overfitting [48]. In contrast, more recent boosting methods like XGBoost [33] and LightGBM [34] also build ensembles of trees, but do so sequentially, focusing on misclassified examples in each iteration. While this can improve predictive accuracy, it may also increase overfitting risk compared to the Random Forest algorithm [49]. The superior performance of Random Forest in this study suggests that the additional hyperparameters and complexity of boosting methods did not provide an advantage over simpler Random Forest ensembles. The nonlinear effects and variable interactions present in our dataset appear well-suited for tree-based models, and the Random Forest algorithm effectively capitalized on these properties [50].

The performance metrics for the ML algorithms presented in this study align with recent research findings. The specific outcomes selected for this study have not been examined within a single study using ML algorithms before. However, several publications have explored the predictive performance of ML algorithms concerning postoperative outcomes following ACDF surgery using diverse data sources. For example, Gowd et al. employed ML models based on conventional comorbidity indices to compare predictive models for postoperative complications following ACDF surgery [21]. In this study, the logistic regression algorithm was the best performing for predicting any adverse event (AUROC = 0.73), transfusion (AUROC = 0.90), surgical site infection (AUROC = 0.63), and pneumonia (AUROC = 0.80), while gradient boosting trees was the best performing for predicting extended LOS (AUROC = 0.73). It is noteworthy that their study used ‘operative time’ as a predictor variable, and it was the most weighted variable for the prediction of any adverse event, extended LOS, and transfusion. Our study deliberately excluded variables like total operative time that would not be known prior to the surgery [51]. It must be kept in mind that instead of being the cause of undesirable outcomes, the length of the procedure might be a mediator [52]. Our study focuses on the preoperative prediction of adverse outcomes.

Rodrigues et al. queried the IBM MarketScan Commercial Claims and Encounters Database and Medicare Supplement from 2007 to 2016 to identify 176,816 patients who underwent an ACDF [22]. Some of the variables that were incorporated in the study were not available in the NSQIP database, such as operative characteristics, including bone morphogenic protein use, the use of anterior cervical plating, allograft or cage implants, and preoperative symptoms, including weakness, stiffness, or cervicalgia. Some of these variables for predicting 90-day readmissions, two-year reoperations, and 90-day complications were among the ones with high magnitudes of attention: myelopathy, human immunodeficiency virus (HIV), weakness, and stiffness. For the prediction of investigated outcomes, the deep neural network-based models in the study by Rodrigues et al. achieved AUROCs of between 0.671 and 0.832. Similarly, Khazanchi et al. investigated the predictive utility of ML and deep learning algorithms on postoperative health care utilization, including 90-day readmissions, postoperative LOS, and non-home discharge, in patients undergoing ACDF [25]. They utilized data from a multisite academic center and included a robust set of patient features, such as demographic information, medical/surgical history, operative characteristics, and preoperative lab values. The highest-performing model in their study was the Balanced Random Forest algorithm, with an AUROC of 0.70 for 90-day readmissions, 0.84 for non-home discharge, and 0.74 for extended LOS. Despite the reported performance and availability of granular data, these studies’ implications are largely exploratory due to the lack of an accessible tool for practical use.

Previously, Russo et al. proposed the novel ACDF Predictive Scoring System (APSS) algorithm using conventional statistics and ML to forecast LOS following one- or two-level ACDF surgery based on patient-specific preoperative characteristics and comorbidities [23]. The best-performing APSS model had an AUROC of 0.68. Although this study provides a form of tool to be used by clinicians, this study is limited by the low sample size of 1,506 patients and lower performance metrics. Arvind et al. also employed ML algorithms to predict complications following ACDF surgery using the NSQIP database [24]. Patients were excluded from the analysis in this study due to incomplete data, and no other exclusion criteria were employed. Although the case deletion method for handling the missing data is the most expedient method, it yields unbiased estimates only if the data are missing completely at random [53]. Likewise, not excluding emergency procedures, infections, tumor cases, trauma, and concomitant posterior approach surgeries increases the potential for preoperative confounding variables concerning surgical indications. In contrast, our model focuses specifically on predicting outcomes for degenerative cervical disease cases. Unfortunately, none of the aforementioned studies provided the source code for data preprocessing and classification models, limiting result reproducibility. Furthermore, none of these abovementioned studies offered a publicly accessible tool. In contrast, our web application provides interpretive predictions for three different outcomes, bridging the gap between complex ML predictions and their evaluation by healthcare professionals.

Our ML models and the associated web application provide individualized, quantitative estimates for unfavorable postoperative outcomes after ACDF. The presented approach represents a significant advancement over generalized risks derived from studies averaging across diverse populations, as well as the common practice of communicating risks qualitatively with some individual quantitative evaluation based on the clinician’s personal experience. However, relying solely on personal experience is constrained by inherently limited patient populations and potential subjective biases. The personalized predictions from our models can be used preoperatively to gauge prognosis during patient counseling, thus contributing to patient care. They allow healthcare professionals to identify patients at risk of certain adverse outcomes, prioritize their treatment, and plan for discharge requirements. Although the current web application provides a convenient interface for estimating the likelihood of adverse short-term postoperative outcomes following ACDF, it is intended as a research tool and should not currently guide clinical recommendations. Further validation in diverse patient cohorts across institutions is essential to confirm its predictive accuracy. We hope this calculator serves as a first step toward more comprehensive models that integrate additional factors like imaging findings and more granular clinical data for further refinement of predictive accuracy and clinical relevancy. As with any prediction tool, the estimates generated must be considered in the full context of each patient to personalize surgical counseling and planning.

Further limitations are similar to the limitations that have been described with other online prognostic models [52]. First, it is important to note that the patients in the ACS-NSQIP database may not be entirely representative of the general ACDF population. There may be biases related to the hospitals included in the database, as these hospitals may have above-average infrastructure and/or resources. Additionally, the patients in the database may have different health status, age, or socioeconomic backgrounds than the general population. Despite Huffman et al. demonstrating that the ACS-NSQIP database is a dependable data source for examining postsurgical outcomes and validating its usage, these limitations can affect the generalizability of our results [54]. Second, studies using a large clinical database are always influenced by coding errors and other inaccuracies. The NSQIP database is frequently used, but only a few studies have looked at it’s precision when it comes to coding. CPT codes for neurosurgical procedures contain numerous internal inconsistencies, according to Rolston et al. [55]. Furthermore, we did not compare our models’ performance to existing comorbidity indices or conduct external validation or user satisfaction analyses within the scope of the current study, which are important aspects to consider in future studies. Finally, we did not aim to identify causal relationships between patient characteristics and outcomes, and did not intend to suggest that our models could be used for causal inference or that they provide any information about the mechanisms underlying the observed associations between patient characteristics and outcomes. We do not encourage making causal interpretations based on the results of the current study.

In conclusion, our study has significantly improved the prediction of postoperative outcomes in patients undergoing ACDF surgery through the application of sophisticated ML methods. A key contribution of our work is the development of a user-friendly web application designed to provide a demonstration of the developed models’ practical utility. Our findings suggest that ML algorithms can serve as an invaluable auxiliary tool for patient risk stratification in ACDF surgery, with the potential to predict a variety of postoperative outcomes. This approach has the potential to play a critical role in counseling ACDF surgery patients, shifting the clinical approach towards a more patient-centric, data-driven model. Therefore, our study represents a substantial advancement in the field of precision medicine.