Musculoskeletal radiologist-level performance by using deep learning for detection of scaphoid fractures on conventional multi-view radiographs of hand and wrist

Objectives To assess how an artificial intelligence (AI) algorithm performs against five experienced musculoskeletal radiologists in diagnosing scaphoid fractures and whether it aids their diagnosis on conventional multi-view radiographs. Methods Four datasets of conventional hand, wrist, and scaphoid radiographs were retrospectively acquired at two hospitals (hospitals A and B). Dataset 1 (12,990 radiographs from 3353 patients, hospital A) and dataset 2 (1117 radiographs from 394 patients, hospital B) were used for training and testing a scaphoid localization and laterality classification component. Dataset 3 (4316 radiographs from 840 patients, hospital A) and dataset 4 (688 radiographs from 209 patients, hospital B) were used for training and testing the fracture detector. The algorithm was compared with the radiologists in an observer study. Evaluation metrics included sensitivity, specificity, positive predictive value (PPV), area under the characteristic operating curve (AUC), Cohen’s kappa coefficient (κ), fracture localization precision, and reading time. Results The algorithm detected scaphoid fractures with a sensitivity of 72%, specificity of 93%, PPV of 81%, and AUC of 0.88. The AUC of the algorithm did not differ from each radiologist (0.87 [radiologists’ mean], p ≥ .05). AI assistance improved five out of ten pairs of inter-observer Cohen’s κ agreements (p < .05) and reduced reading time in four radiologists (p < .001), but did not improve other metrics in the majority of radiologists (p ≥ .05). Conclusions The AI algorithm detects scaphoid fractures on conventional multi-view radiographs at the level of five experienced musculoskeletal radiologists and could significantly shorten their reading time. Key Points • An artificial intelligence algorithm automatically detects scaphoid fractures on conventional multi-view radiographs at the same level of five experienced musculoskeletal radiologists. • There is preliminary evidence that automated scaphoid fracture detection can significantly shorten the reading time of musculoskeletal radiologists. Supplementary Information The online version contains supplementary material available at 10.1007/s00330-022-09205-4.


Introduction
Scaphoid fractures, the most common type of carpal bone fracture (82-89%) [1], are typically diagnosed through clinical and conventional radiographic examination. However, it is well known that they can be difficult to detect on initial radiographs: reported percentages of missed scaphoid fractures vary from 7 to 50% [2][3][4]. If missed scaphoid fractures are left untreated and become displaced, the risk of developing non-union can be high (14-50%) [5]. Non-union fractures can have serious complications, such as progressive degeneration and collapse of the carpal bones [6]. Hence, more than half of the patients receive unnecessary wrist immobilization out of precaution [7,8]. Since non-union fractures and overtreatment increase costs in healthcare and lost productivity, it is important to investigate strategies to aid early and accurate scaphoid fracture diagnoses.
In general, two diagnostic strategies have been discussed in the literature. The first strategy involves complementing or replacing initial conventional radiography with follow-up conventional radiography (10-14 days) or advanced imaging modalities such as CT and MRI. The second strategy involves utilizing the diagnostic value of conventional radiography via artificial intelligence (AI) software. Karl and Swart [9] and Yin et al [10] showed that immediate CT and MRI scans were more cost-effective strategies compared to follow-up radiographs except when lost productivity from immobilization is slight. However, if there is a limited number of scanners available or if the cost imposed to use these scanners is too high, conventional radiography is the primary or only means for imaging scaphoid fractures.
There is a growing body of literature demonstrating that deep learning-based AI software can obtain a diagnostic performance comparable to clinicians in detecting fractures at imaging [11]. Recently, Hendrix et al [12] and Yoon et al [13] respectively demonstrated that AI software can achieve radiologist-level performance on scaphoid fracture diagnosis in conventional radiographs and that it can detect occult fractures with high accuracy, thereby showing that AI has the potential to aid radiologists in detecting scaphoid fractures. Large-scale retrospective studies conducted by Duron et al [14] and Guermazi et al [15] showed that AI software could indeed improve the sensitivity and specificity of radiologists and other physicians in detecting various skeletal fractures. However, performance measures for scaphoid fracture diagnosis were lacking and only Duron et al let musculoskeletal (MSK) radiologists participate in their observer study, who are specialized in diagnosing skeletal fractures. Moreover, previous works [12,13,16,17] on automated scaphoid fracture diagnosis only involved the use of anterior-posterior (AP) and posterior-anterior (PA) radiographs, whereas in clinical practice multiple radiographic views such as oblique and lateral views are used. These limitations raise the question whether previous findings hold when scaphoid fracture diagnosis is conducted with multiview radiographs and whether AI software can improve the performance of radiologists in this setting, particularly that of MSK radiologists.
The purpose of this study was therefore to assess how an AI algorithm performs against experienced MSK radiologists in detecting scaphoid fractures on conventional multi-view radiographs and to assess whether it can aid MSK radiologists in clinical practice.

Datasets
This retrospective study was approved by the medical ethical review boards of the Radboud University Medical Center (Radboudumc) and the Jeroen Bosch Hospital (JBZ) in the Netherlands. Informed written consent was waived, and data collection and storage were performed in accordance with local guidelines. Dataset 1 (12, served for respectively training and testing a CNN-based fracture detection algorithm. The training and test datasets were gathered at different hospitals to assess the generalization performance of the algorithm. Only radiographs acquired at the initial hospital visit were included in dataset 4 as we focused on early fracture detection. Furthermore, the number of available radiographic views varied per study and patient. An overview of the characteristics of the datasets is provided in Table 1 (refer to Appendix E1 [online] for additional imaging parameters). This table also describes a patient overlap between these datasets and those from Hendrix et al [12]. Data from the previous study was added to the training datasets (datasets 1-3) to reduce the annotation effort, and ten patients (out of 209 [5%]) overlapped between the test dataset of the present study (dataset 4) and previous study as result from

Ground truth
Two MSK radiologists (K.v.D. and M.R., with 22 and 26 years of experience, respectively) determined the ground truth for dataset 4. For each patient they determined the fracture presence in six scaphoid regions as defined by Wong and Ho [18]. These regions included the following: scaphoid tubercle (A1), distal articular (A2), distal 1/3 (B1), middle 1/3 (B2), proximal 1/3 (B3), and proximal pole (C). All cases were independently reviewed and disagreements were resolved by consensus reading. Both radiologists had access to all available imaging information (conventional radiography, CT, and MRI studies) and clinical information (clinical questions, and patient demographics and history) in the PACS and electronic health record (EHR) system (refer to Appendix E4 [online] for an overview of the reference standards).

AI pipeline
The pipeline of the scaphoid fracture detection AI algorithm is summarized in Fig. 2. The algorithm was designed for processing a radiographic study with an arbitrary number of series and it was implemented on an NVIDIA RTX Titan graphics processing unit with the PyTorch machine learning framework [19]. First, the scaphoid localization and laterality classification CNN localized the scaphoid and determined its orientation (frontal view, including [ulnar-deviated] AP/PA and oblique, or lateral view) and laterality (left or right hand). The scaphoid was then extracted from the image and was analyzed by either a frontal view or lateral view fracture detection CNN. In this analysis, a fracture score was generated for each of the six scaphoid regions as defined by Wong and Ho [18]. All processing steps were repeated for every input image and finally the maximum fracture score per region and per hand was selected. The whole algorithm is freely available at https://grandchallenge.org/algorithms/multiview-scaphoid-fracturedetection/, where it can be run in a web browser. A detailed description of the processing steps and training procedure is provided in Appendix E5 and E6 (online).

Observer study
To validate the performance of the fracture detection AI algorithm as well as its potential value as a computer-aided (test data). The number of studies at each step is denoted with n. DBC, diagnosis-treatment combination; ICD-10, International Classification of Diseases Version 10; JBZ, Jeroen Bosch Hospital. 2 Studies with ICD-10 diagnosis code S62.00 were included. 3 There was no patient overlap between the samples from the non-fracture and fracture category. 4 Studies were excluded when the wrist was in cast, the scaphoid was  [18] for the presence of a fracture. They indicated their confidence for each region on a continuous scale from 0 to 1.0, where 1.0 indicates absolute certainty of a fracture and 0.5 is the cut-off point for determining whether a fracture was present. In cases where radiographs of both hands were taken, the radiologist indicated to which hand(s) their ratings applied. After a 4-month washout period, the radiologists repeated assessing all cases using the same protocol while being provided with the predictions of the algorithm. To minimize potential recall bias after the washout period, the order of the cases was shuffled.

Statistical analysis
The auxiliary scaphoid localization and laterality classification CNN were separately evaluated on datasets 1 and 2. The evaluation details are provided in Appendix E7 (online). The fracture detection AI algorithm was cross-validated on dataset 3 using 10 folds (no patient overlap) and was tested on dataset 4. The evaluation metrics included the following: sensitivity, specificity, PPV, Cohen's kappa coefficient (κ), area under the receiver operating characteristic curve (AUC), mean precision in localizing the fracture locations per scaphoid ("mean localization precision" [MLP]), and reading time (in seconds). The detection threshold that maximized the F1-score was chosen for the analysis. The fracture scores of the algorithm were based on automated image crops and laterality labels. The radiographic inputs from datasets 3 and 4 were respectively grouped by study and patient, and the corresponding scores were grouped by hand. Cohen's κ was used for measuring the agreement between the algorithm and radiologists.
The evaluation metrics were calculated using the scikitlearn machine learning library (version 0.23.2, 2021) [20] for Python. Stratified bootstrapping with 1000 iterations was applied for estimating 95% confidence intervals (CIs). Significance testing was performed with two-sided paired permutation tests with 1000 iterations using the MLxtend library (version 0.19.0, 2021) [21] for Python. A difference with a p value smaller than .05 was considered significant.

Test dataset characteristics
From the initial sample of 292 studies selected for the test dataset (dataset 4), 20 studies were excluded due to radiographs unsuitable for fracture diagnosis (n = 7), inconclusive evidence (n = 7), and non-acute fractures (n = 6) (see Fig. 1). This resulted into a final selection of 272 studies from 209 patients (mean age, 39 years ± 23 [standard deviation]; 107 women). The studies were grouped by patient and hand into 219 cases, of which 65 cases contained a scaphoid fracture (see Table 1). There was at least one PA view in all cases and there was at least one ulnar-deviated PA, oblique, and lateral view in 55 cases (30 with fracture), 159 cases (52 with fracture), and 156 cases (63 with fracture), respectively.

Fracture detection by AI
A quantitative and qualitative analysis of the scaphoid localization and laterality classification results are included in Appendix E8 (online). The scaphoid fracture detection AI algorithm obtained an AUC of 0.89 (95% CI: 0.87, 0.91) on dataset 3. The corresponding ROC curve and additional evaluation metrics are included in Appendix E9 (online). Table 2 presents the sensitivity, specificity, PPV, MLP, and AUC with their 95% CIs of the scaphoid fracture detection AI algorithm  Fig. 3a. The ROC curve of each input configuration is shown in Fig. 3b. Radiologist performance in scaphoid fracture detection with and without AI assistance Table 3 presents the sensitivity, specificity, PPV, MLP, AUC, and median reading time with their 95% CIs and p values of the five MSK radiologists for scaphoid fracture detection with and without AI assistance. The corresponding ROC curves are shown in Fig. 4a  In all other cases, AI assistance had no effect on the sensitivity, specificity, PPV, MLP, AUC, and reading time. Table 4 shows the Cohen's kappa coefficients with their 95% CIs and p values for the fracture detection agreements between the radiologists (with and without AI assistance) and the AI algorithm. Overall, the radiologists were in a moderate to substantial agreement with each other (range without AI assistance: 0.50-0.71; range with AI assistance: 0.62-0.79). With AI assistance, five out of the ten pairs of radiologists had a higher agreement (Rad1-Rad3, 0.56 vs. 0.79, p = .002; Rad2-Rad3, 0.50 vs. 0.71, p = .006; Rad2-Rad4, 0.58 vs. 0.74, p = .02; Rad3-Rad4, 0.53 vs. 0.74, p = .003; Rad3-Rad5, 0.52 vs. 0.68, p = .03), whereas the agreement between all other pairs of radiologists remained unchanged. With AI assistance, two out of the five radiologists had a higher agreement with the algorithm (Rad3, 0.56 vs. 0.72, p = .02; Rad4, 0.62 vs. 0.80, p = < .001), whereas all other radiologists agreed equally well with the algorithm. The proportion of correctly and incorrectly changed fracture diagnoses by the radiologists with AI assistance and their correlation with the automated fracture scores are shown in Appendix E11 (online). Fig. 3 a Receiver operating characteristic (ROC) curve (blue) with operating point of the automated scaphoid fracture detection results based on all available radiographic views from dataset 4 (65 fracture cases, 154 non-fracture cases; each case represents one hand from one patient). The corresponding mean localization precision curve (orange) is shown as well. The shaded bands represent 95% confidence intervals. The black line represents no ability to discriminate between fracture and non-fracture cases. b Receiver operating characteristic (ROC) curves of the automated scaphoid fracture detection results for multiple input configurations on the same dataset. All 219 cases included at least one posterior-anterior (PA) view (65 cases with fracture), 55 cases included at least one ulnar-deviated PA view (30 cases with fracture), 159 cases included at least one oblique view (52 cases with fracture), and 156 cases included at least one lateral view (63 with fracture)
A follow-up analysis of the decisions made by the algorithm and MSK radiologists revealed that six out of the 29 mistakes of the algorithm (three fracture cases, three nonfracture cases) were not made by any of the radiologists. Conversely, 12 out of the 23 mistakes made by the majority of radiologists (three fracture cases, nine non-fracture cases) were not made by the algorithm. The failure cases of the algorithm and radiologists are shown in Figs. 5 and 6 respectively.

Discussion
Patients with a clinically suspected scaphoid fracture often receive unnecessary wrist immobilization, as acute scaphoid fractures can cause severe damage to the wrist if they remain undetected and untreated. This study assessed how a CNNbased AI algorithm performed against five MSK radiologists and whether it aided diagnosis of scaphoid fractures on conventional multi-view radiographs. The algorithm was shown to be able to detect scaphoid fractures as well as the MSK radiologists (AUC, 0.88 vs. 0.87 [average of radiologists]; p ≥ .05 for all radiologists). Moreover, it was able to indicate which regions in the scaphoid were fractured as precisely as the majority of radiologists (MLP, 87% vs. 92% [average of Note. 95% CIs are shown in parentheses. The p values refer to the differences in evaluation metrics between the with and without AI assistance condition. AUC, area under the receiver operating characteristic curve; MLP, mean localization precision; PPV, positive predictive value; s, seconds radiologists], p ≥ .05 for majority). Furthermore, AI assistance improved five out of ten pairs of inter-observer Cohen's κ agreement (average increase of 36.2%, p < .05 for all pairs) and reduced the reading time of four radiologists (average reduction of 49.4%, p < .001 for all radiologists), but no improvements were found in sensitivity, specificity, PPV, and AUC for the majority of radiologists. The results showed that the scaphoid fracture detection performance of the AI algorithm improved when PA views were supplemented with ulnar-deviated and oblique PA views, whereas adding lateral views did not lead to a performance increase. These findings underline the conclusions of Cheung et al [22] that the PA, ulnar-deviated PA and oblique view are most important for scaphoid fracture detection and indicate that this also applies to deep learning-based AI algorithms. This implies that a multi-view approach to scaphoid fracture detection should be adapted in future AI applications.
A qualitative analysis of the failure cases revealed that the algorithm made six mistakes (three fracture cases, three nonfracture cases) that none of the MSK radiologists made. The false positive detections were likely to be caused by overprojection lines of the other carpal bones on the scaphoid on the lateral view and a sclerotic line from an old healed fracture. The false negative cases included two scaphoids with an evident, but non-sharply delineated scaphoid waist fracture. The latter finding is in line with the observations of Hendrix et al [12] and Langerhuizen et al [23] that deep learning-based AI algorithms may miss fractures that are evident to human observers.
There were 12 mistakes made by the majority of MSK radiologists (three fracture cases, nine non-fracture cases) that were not made by the algorithm. In most of the false positive cases (5/9), the scaphoid and its surrounding joints showed degenerative signs or slight deformities, which could suggest a fracture even when no hypodense line was visible. The remaining false positive detections were caused by very subtle or diffuse hypodense lines. The false negative detections were made in scaphoids with a displaced fracture causing a subtle protrusion of the cortical bone with no evident fracture line visible. Similar false negative detections made by radiologists but not by an AI algorithm can be observed in Hendrix et al [12], but qualitative analyses of false positive detections are lacking in previous studies. While these findings suggest that the algorithm may have merit in aiding the interpretation of degenerative or malformed scaphoids for fracture diagnosis, follow-up studies are required to confirm this.
The found positive effects of AI assistance on the interobserver agreement and reading time of the radiologists provide preliminary evidence that the algorithm could improve the diagnostic efficiency of MSK radiologists with the same diagnostic accuracy. However, one radiologist had a significantly lower sensitivity and AUC in the AI assistance Fig. 4 a Receiver operating characteristic (ROC) curves of the results of the scaphoid fracture detection algorithm and those of the musculoskeletal (MSK) radiologists without artificial intelligence (AI) assistance on dataset 4 (65 fracture cases, 154 non-fracture cases; each case represents one hand from one patient). b ROC curves of the results of the scaphoid fracture detection algorithm and those of the MSK radiologists with AI assistance on the same dataset. The corner of each plot is magnified for easier comparison of the curves. The black line represents no ability to discriminate between fracture and non-fracture cases.
condition, but the incorrectly changed answers were weakly correlated with the answers from the algorithm (0.34). The decreases in reading time are in line with the conclusions from Duron et al [14] and Guermazi et al [15], but we did not find any increases in sensitivity or consistent increases in specificity. This difference could be due to the few carpal fractures in their test datasets. Furthermore, even though it was not investigated in this study, it could be expected that general radiologists, radiology residents, and other physicians may benefit more from AI assistance.
The strengths of this study included the use of multi-center clinical data and external validation, participation of five experienced MSK radiologists, and the evaluation of an automatic multi-view scaphoid fracture detection AI algorithm. However, this study also had some limitations. First, we aimed to minimize selection bias in our test set by using all available information in the PACS and EHR system rather than using only studies with a follow-up CT or MRI scan for testing. This means that occult scaphoid fractures might have been labelled as negative cases when the patient did not return to the hospital with persistent symptoms. The reference standard quality and selection bias trade-off problem could be circumvented by conducting a prospective study in which patients immediately undergo a CT and MRI scan after an initial examination with conventional radiography, but this would be too time intensive and costly for acquiring sufficient data.
Second, even though we investigated the contribution of each radiographic view to automated scaphoid fracture detection, we simplified the model architecture by processing AP/PA, ulnar-deviated AP/PA, and oblique view radiographs by a single CNN. The model performance might be further improved in future research by training separate CNNs for all views.
In conclusion, the findings presented in this study support the hypothesis that an AI algorithm can achieve MSK radiologist level performance in detecting scaphoid fractures on conventional multi-view radiographs. Moreover, there is preliminary evidence that AI assistance could improve the diagnostic efficiency of MSK radiologists, but not their diagnostic accuracy. Future research should evaluate the impact of AI assistance on diagnostic performance, clinical decision making, and patient outcomes in a randomized clinical trial involving both radiologists and non-radiologists.