Background

Cervical spondylotic myelopathy (CSM) is a common degenerative neurological condition that occurs when the spinal cord is compressed due to degenerative changes or traumatic injury to the cervical spine [1, 2]. If left untreated, 20–60% of patients with CSM may experience neurological deterioration [3, 4]. Treatment options for CSM include anterior and/or posterior approaches to the spine [5, 6]. The anterior approach is typically used for patients with anterior spinal cord issues or cervical kyphosis, while the posterior approach is more commonly employed for treating multilevel stenosis and dorsal pathology [7,8,9].

There are two main posterior approaches for treating CSM: Laminectomy with or without posterior spinal fusion (LPSF) and laminoplasty [10, 11]. The choice between these approaches depends on individual cases and various clinical and radiological factors [7, 8]. LPSF is often recommended for patients experiencing axial neck pain, reduced cervical lordosis, and significant translational movement with flexion-extension [12, 13].

While some studies have examined predictors of clinical outcomes in CSM patients undergoing LPSF, factors such as patient age, smoking, preoperative neurological status, symptom duration, and the presence of comorbidities have been suggested to influence outcomes [12, 14, 15].

Machine learning models (MLMs) have become a valuable tool for analyzing the impact of different variables [16,17,18]. These models can identify patterns and relationships in data sets, making predictions based on flexible data relationships without relying on specific Eqs. [19, 20]. MLMs have increasingly been used in biostatistics and medicine to categorize and predict patient outcomes [18, 20,21,22]. However, there is limited research on using MLMs to predict factors influencing the clinical outcomes of CSM patients undergoing LPSF. This study seeks to evaluate the effectiveness of MLMs in predicting these factors.

Methods

Study population

We conducted a retrospective analysis of 329 consecutive patients diagnosed with cervical spondylotic myelopathy (CSM) based on radiological findings who underwent cervical laminectomy and fusion at our institution. The study included patients treated between Jul 2017 and Jul 2023, with approval from our institute’s ethics committee and written informed consent from all participants. Patients with a history of prior cervical surgery, cervical spine tumors, or congenital cervical deformities were excluded from the study.

Evaluation of clinical outcomes

We assessed patients’ neurological status using the modified Japanese Orthopaedic Association (mJOA) scale before surgery and at the last post-operative follow-up. The recovery rate of myelopathy was calculated using the formula: Recovery Rate = (Postoperative JOA score - Preoperative JOA score) / (17 - Preoperative JOA score) × 100%. Previous research has indicated that a minimum clinically important difference (MCID) for JOA score recovery rate is 52.8% [14, 23]. Patients were categorized into two groups: the good group (those achieving MCID) and the poor group (those not achieving MCID). Various variables such as age, gender, BMI, smoking status, diabetes, number of laminectomy levels, MRI signal changes, symptom duration, preoperative JOA scale, Pavlov ratio, cervical curvature, and range of motion (ROM) were compared between the two groups as potential predictors for poor clinical outcomes.

Radiological assessment

Radiological evaluations of the cervical spine, including plain radiographs, cervical computed tomography (CT) scans, and cervical MRI scans before and after surgery, were conducted. Cervical spinal curvature based on Cobb’s method and the canal-body ratio (Pavlov ratio) were measured preoperatively and postoperatively. Cervical ROM was calculated by summing the cervical angles in maximal flexion and extension on lateral radiographs. Rates of loss of cervical curvature and ROM were assessed using specific formulas [12, 14, 24].

Loss of lordosis (%) = (preoperative cervical curvature - final visit.

cervical curvature)/ (preoperative cervical curvature) ×100%.

Loss of cervical spine ROM (%) = (preoperative ROM - final visit.

ROM)/ (preoperative cervical curvature) ×100%.

Signal changes within the spinal cord were identified using T1- and T2-weighted MRI images. All radiological measurements were performed by a senior author who was blinded to the clinical outcomes.

Statistical analysis

Data analysis was carried out using SPSS 23 software. Results were presented as mean ± standard deviation. Statistical tests including Mann-Whitney U test, independent t-test, and Pearson’s chi-square test were used to compare variables between the good and poor outcome groups. Additionally, binary logistic regression analysis was conducted to identify independent associations between prognostic factors and clinical outcomes. A significance level of p < 0.05 was considered statistically significant.

Model development

In this research, we employed four machine learning models: random forest (RF), logistic regression (LR), support vector machine (SVM), and k-nearest neighbors (k-NN), to predict treatment failure in thoracolumbar burst fractures treated with SSPSF. Additionally, LR and RF models were used to analyze factors associated with treatment failure. Each model underwent training before assessment. The dataset was split into training and test sets at an 80:20 ratio. The training set was used to train the models, while the test set was used to evaluate model performance. Feature selection was based on significance in univariate analysis, with significant features from the univariate analysis serving as inputs for the machine learning techniques.

Decision tree (DT) and random forest (RF) models

A decision tree (DT) is a tree-like structure that makes decisions based on input data, with the root node posing the initial question. Each node is connected to subsequent child nodes through branches, determining the best-split feature using a split criterion. The binary DT divides each parent node into two child nodes until all observations are classified, leading to a leaf node or outcome. Random forest (RF) is an ensemble of multiple DTs. Each tree independently predicts the outcome and votes for the most likely class. RF assigns the outcome based on the majority vote, leveraging multiple trees to make accurate predictions by capturing complex relationships. In this study, 500 DTs were utilized to construct the RF model, known for handling complex data and mitigating overfitting in classification and regression tasks [25, 26].

Logistic regression (LR)

Logistic regression (LR) is a widely used predictive model for clinical decision-making and binary outcome classification. The LR algorithm generates a sigmoid curve to depict the relationship between inputs and outcomes, mapping inputs to probabilities (ranging from 0 to 1) that indicate the likelihood of belonging to one of two classes. By employing the logistic regression model, calculating the probability of each data point belonging to a specific outcome is straightforward. Following the determination of probabilities for each individual’s class membership, individuals are assigned to the group with the highest probability.

Support vector machine (SVM)

Support vector machine (SVM) is a machine learning algorithm used for regression and classification tasks, finding applications in various fields such as chemometrics, bioinformatics, and biometrics. The core principle involves establishing an optimal decision boundary, represented as a line, to separate data points and minimize errors. In a two-dimensional plane, each dimension corresponds to an attribute or feature, with observations depicted as data points. The algorithm aims to create a hyperplane that effectively separates one group of points from another in a linear manner. When data is linearly separable, hyperplanes with maximum margins between points and the hyperplane are ideal for accurate predictions. In cases where data is not linearly separable, a kernel function is employed to map data to a higher-dimensional space, enabling linear separation without altering the original data. In this study, the radial basis function (RBF) kernel, known for its generalizability, was utilized [27, 28].

K-nearest neighbors (K-NN)

The k-nearest neighbors (k-NN) algorithm is a straightforward supervised machine learning technique used for classification and regression. Its objective is to assign a data point to a class based on the nearest point in the training dataset. The predictive class is determined by the majority class among the nearest neighbors. For regression, the average value of neighboring points is used. The algorithm’s steps for classifying new data involve determining the number of nearest neighbors (k), calculating distances between new data and training data points, ranking distances, and classifying the new data based on majority votes from neighboring points.

Performance Evaluation The performance of predictive models was assessed using metrics such as accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). Additionally, the area under the curve (AUC) of the receiver operating characteristic (ROC) was employed to evaluate the models’ ability to predict treatment failure [29, 30].

Software

For statistical analysis, SPSS version 23 was used to present descriptive and inferential statistics, as well as to conduct univariate and multivariate analyses. The randomForest package was employed for fitting the RF model, the e1071 package for SVM fitting, and the caret package for calculating performance metrics. These packages are available in R4.0.3 software.

Results

Table 1 summarizes the demographic characteristics of 329 patients who underwent posterior cervical laminectomy and fusion. The cohort comprised 185 males (56.2%) and 144 females (43.8%), with a mean age of 64.23 ± 7.21 years and an average follow-up duration of 17.86 ± 1.74 months. Among the patients, 137 individuals (41.6%) underwent ≤ 3 levels of cervical laminectomy and fusion, while 192 cases (58.4%) underwent > 3 levels (Tables 1 and 2). Table 3 presents various variables and clinical outcomes. The favorable outcome group included 267 patients with a JOA score recovery rate ≥ 52.8%, whereas the poor outcome group comprised 62 patients with a JOA score recovery rate < 52.8%. A statistically significant improvement in mJOA score was observed at the final follow-up (P < 0.05). Table 2 displays the mean and standard deviations of baseline and final values of radiological and clinical characteristics.

Table 1 Descriptive characteristics of the sample
Table 2 Mean and standard deviation of quantitative variables
Table 3 Relationship between qualitative variables and clinical outcomes

Predictors of poor clinical outcome based on univariate analysis

In our investigation, age, preoperative mJOA scale, symptom duration, and MRI signal changes were identified as predictors of poor clinical outcome in the univariate analysis (p < 0.05) (Tables 3 and 4). No association was found between clinical outcome and gender, number of laminectomy levels, smoking status, diabetes mellitus, BMI, preoperative Pavlov ratio, preoperative cervical curvature, and preoperative cervical spine range of motion (Tables 3 and 4).

Table 4 Relationship between clinical outcomes and quantitative variables

Predictors of poor clinical outcome based on multivariate analysis

Binary logistic regression analysis revealed that age (odds ratio [OR] 2.08; 95% confidence interval [95% CI] 1.47–2.54; P = 0.013), preoperative mJOA scale (OR 3.52; 95% CI 2.83–4.56; P < 0.001), and symptom duration (OR 1.37; 95% CI 1.01–2.01; P = 0.031) were independent predictors of poor clinical outcome (Table 5).

Table 5 Binary Logistic Regression Analysis

Each machine learning model utilized feature selection to assess the independent significance of risk factors. According to the mean Gini index, the RF model identified preoperative mJOA scale, age, symptom duration, and MRI signal changes as the most crucial variables in descending order. The study evaluated the predictive accuracy of LR, RF, SVM, and k-NN models for poor clinical outcomes. RF demonstrated the highest accuracy of 0.922, followed by SVM at 0.901, k-NN at 0.887, and LR at 0.876, respectively. RF also showed superior sensitivity and specificity compared to the other models (sensitivity = 0.851, specificity = 0.944). LR, SVM, and k-NN predicted poor clinical outcomes with negative predictive values (NPVs) of 0.849, 0.803, and 0.794, respectively. The AUC values for RF, LR, SVM, and k-NN were 0.905, 0.827, 0.851, and 0.883, respectively (Table 6).

Table 6 Evaluation criteria for comparison performance of machine learning models (LR, RF, SVM and k-NN)

Discussion

Our results showed that age, preoperative mJOA scale, and duration of symptoms were predictors of poor clinical outcome. There are several studies that evaluated the relationship between preoperative severity of myelopathy/duration of myelopathy symptoms with the clinical outcome of patients with CSM [2, 14]. The majority of evidence has revealed a significant predictive value for the severity of preoperative myelopathy and duration of myelopathy symptoms in the clinical outcomes of these patients [8, 15]. However, some studies reported no significant relationship between the modified Japanese Orthopaedic Association (mJOA) or JOA recovery rate after operation and the severity of baseline myelopathy symptoms [31]. In a retrospective study, Gao et al. assessed the clinical outcome of 145 consecutive patients undergoing surgery for CSM, with a mean followup of 5 years. Their results showed that subjects with a preoperative JOA of ≤ 9 were 4.84 times more likely to exhibit a “fair” outcome (< 50% recovery rate) in comparison with those with a JOA > 9 [32]. Furthermore, Pumberger et al. showed that cases with less severe myelopathy on the Nurick grading system (≤ 3) were more likely to achieve a grade of 0, 1, or 2 after surgery in comparison with those with a baseline grade of ≥ 4. Moreover, they reported that cases with symptom duration of less than 1 year were 4.8 times more likely to improve and 14 times more likely to return to a Nurick grade of 0 after operation when compared to the subjects with symptom duration of more than 1 year [33]. There is a controversy on the impact of the age on clinical outcome of patients with CSM [5]. Our results showed that advanced age was associated with poor outcome. Although the majority of evidence demonstrated a significant predictive value for age in patients with CSM, some studies revealed no association between age and outcomes in terms of mJOA, Nurick, and SF-36 scores [5, 34]. Based on the analysis of the Cervical Spondylotic Myelopathy (CSM)North America and CSMInternational datasets, Tetreault et al. reported that patients with advanced age were less likely to achieve an mJOA score ≥ 16 at 12 months or achieve a MCID after operation for CSM [10]. One hypothesis that could explain this finding is that people who are younger and less severely affected have minor neuropathologic changes in the spinal cord [10, 14].

The objective of this study was to employ machine learning models to predict factors associated with poor clinical outcomes of patients with CSM who underwent LPSF. The results presented in Table 6 indicate that all machine learning models performed well, with Random Forest (RF) demonstrating superior performance across all criteria in predicting treatment failure with the least amount of error. When comparing the classification ability of the evaluated models, RF outperformed the others. RF is an ensemble learning method that combines multiple decision trees to make predictions. Several characteristics contribute to its superior performance [25, 35]. Firstly, the ensemble approach helps mitigate overfitting and enhances the model’s generalization ability by combining predictions from different subsets of the data. Secondly, RF provides a measure of variable importance, identifying the relative contribution of each input variable in making predictions. This feature aids in identifying influential factors associated with treatment failure. Additionally, RF is capable of capturing complex nonlinear relationships, handling outliers and missing data, and does not assume a specific data distribution, making it suitable for analyzing complex datasets without strict assumptions [35, 36]. The study found that all models demonstrated acceptable performance in terms of the area under the curve (AUC), yielding reliable predictions without sacrificing sensitivity and specificity. However, it was noted that the performance of the predicting models is dependent on the training dataset, and partiality in training can introduce bias. The study used 80% of the data for training and 20% for testing, but acknowledged that a larger dataset would help reduce bias. Missing data was identified as an important limitation, but in this study, there was no missing data due to meticulous physical exams and clinical evaluations.

Limitations

Limitations of the study should be considered when interpreting the findings and their clinical implications. The retrospective design and reliance on existing medical records may lead to incomplete or missing data, potentially limiting the ability to account for all relevant variables and confounders. Additionally, the study was conducted at a single center, potentially limiting the generalizability of the findings. Although the study included 329 subjects, a larger sample size would enhance statistical power and generalizability. While the machine learning models demonstrated satisfactory predictive performance, their interpretability may be limited. Understanding the specific factors driving the predictions of these models can be challenging, potentially affecting their clinical utility and decision-making process. Prospective studies with standardized data collection protocols would provide more robust and comprehensive results.

Conclusions

This study demonstrated the efficacy of machine learning models in predicting the clinical outcomes of patients with CSM who underwent LPSF. The findings underscore the capacity of these models to anticipate clinical results in this particular patient cohort, offering invaluable prognostic information for guiding preoperative discussions and postoperative care.