Analysis of spinal surgeons’ recommendations
An independent panel of five spinal surgeons (fellowship trained spinal surgeons with more than 5 years of experience in practice) was set up. The panel reviewed the 500 medical vignettes to determine the surgical recommendation probability for each vignette (recommendations ranging from 0 to 1). Figure 1A plots the univariate analyses of doctor recommendations. Overall, we observe that doctor recommendation probabilities were spread between 0 and 1, whereas for doctors 3, recommendations were skewed towards high probabilities. Bivariate analyses were then conducted between doctors which found that doctors' recommendation probabilities were positively but only moderately correlated (Fig. 1B). The average pairwise correlation was 0.4957, the lowest correlation was 0.36 between doctors 1 and 2, and between doctors 1 and 5, while the highest correlation was 0.72 between doctors 3 and 4. Pairwise Cohen's kappa between doctors also revealed moderate agreements between doctors (Supplementary File 4A). The standard deviations of recommendations were moderate, revealing good consistency of individual doctor recommendations (Supplementary File 4B).
These results thus suggest that, although doctors' recommendations were positively correlated, the agreement between doctors was moderate and one doctor was biased towards high recommendation probabilities, reflecting a high level of heterogeneity between individual doctor recommendations.
Model predictions of surgical recommendation probabilities
An assessment of the accuracy of our hybrid model to predict surgical recommendations was conducted, in comparison to individual doctor recommendations. For this purpose, for each vignette, the ground truth probability for surgical recommendation was calculated as the average between the five independent doctors’ recommendation probabilities. We removed vignettes showing a very high disagreement between doctors (top 10% highest variance). The model was used to compute the recommendation probability for the same vignettes. The vignettes were randomly split into 70% of vignettes to train the random forest, 10% of vignettes for hybrid model weight estimation and 20% vignettes to estimate prediction accuracy (note that model training was irrelevant for the Bayesian network which was not trained through data). The root mean square error (RMSE) between the model prediction and ground truth probabilities was 0.0964 (Fig. 2A). The Pearson correlation and the R2 were 0.9093 and 0.8268, respectively. When plotting the linear regression y = ax + b (assuming a linear relation between model prediction and ground truth) with y = x (assuming perfect agreement between model prediction and ground truth), we observed that the model had the tendency to slightly overestimate low ground truth probabilities (when surgery should not be done), while slightly underestimating high ground truth probabilities (when surgery should be done). In the hybrid model, the relative weights for the random forest and the Bayesian network were 0.85 and 0.15, putting more weights to machine learning. Random forest slightly overestimated low ground truth probabilities, but globally was performing better than the Bayesian network, explaining the higher weight of the former (Supplementary File 5). Lower performance of the Bayesian network was expected, since it was developed without any training from data.
The average RMSE between individual doctor recommendations and ground truth was 0.1940 (Fig. 2B). The average Pearson correlation and the average R2 were 0.7846 and 0.6155, respectively. When plotting the linear regression y = ax + b with y = x, we observed that the doctor 3 was globally overestimating the ground truth probabilities.
When predicting surgical recommendation probabilities, our validation performed on vignettes revealed that the AI model we built performed comparably to individual doctor recommendations.
Variable importance
We next assessed which variables were the best predictors of surgical recommendation. For this purpose, we computed variable importance from the random forest model to identify the best predictors of surgical recommendation. Variables related to radiologic findings ranked among the top predictors, including “Imaging showing stenosis”, “Imaging showing disc herniation” and “Imaging showing segmental instability”. Moreover, certain clinical symptoms including “Motor deficit as reported by doctors”, “Back pain” and “Leg weakness as reported from patient” were also very influential (Fig. 3).
Model predictions of surgical recommendation as binary decision
Surgical recommendations were also analyzed as a dichotomous classification, by discriminating between two classes: no or weak recommendation class vs. strong recommendation class, with a probability threshold of 0.66.
The AUROC between model and ground truth recommendations was 0.9266 (Fig. 4A), while the sensitivity and specificity were 0.8 and 0.8298, revealing good accuracy metrics. The Cohen's kappa for interrater agreement was 0.6298. In comparison, the average AUROC based on individual doctor’s recommendations was 0.8412 (Fig. 4B), and the sensitivity and specificity were 0.7850 and 0.7830, respectively. Average Cohen's kappa was 0.5659, showing similar agreement.
In a dichotomous classification setting, these results reveal that our model performed comparably to individual doctors.