Introduction

Scaphoid fractures, the most common type of carpal bone fracture (82–89%) [1], are typically diagnosed through clinical and conventional radiographic examination. However, it is well known that they can be difficult to detect on initial radiographs: reported percentages of missed scaphoid fractures vary from 7 to 50% [2,3,4]. If missed scaphoid fractures are left untreated and become displaced, the risk of developing non-union can be high (14–50%) [5]. Non-union fractures can have serious complications, such as progressive degeneration and collapse of the carpal bones [6]. Hence, more than half of the patients receive unnecessary wrist immobilization out of precaution [7, 8]. Since non-union fractures and overtreatment increase costs in healthcare and lost productivity, it is important to investigate strategies to aid early and accurate scaphoid fracture diagnoses.

In general, two diagnostic strategies have been discussed in the literature. The first strategy involves complementing or replacing initial conventional radiography with follow-up conventional radiography (10–14 days) or advanced imaging modalities such as CT and MRI. The second strategy involves utilizing the diagnostic value of conventional radiography via artificial intelligence (AI) software. Karl and Swart [9] and Yin et al [10] showed that immediate CT and MRI scans were more cost-effective strategies compared to follow-up radiographs except when lost productivity from immobilization is slight. However, if there is a limited number of scanners available or if the cost imposed to use these scanners is too high, conventional radiography is the primary or only means for imaging scaphoid fractures.

There is a growing body of literature demonstrating that deep learning–based AI software can obtain a diagnostic performance comparable to clinicians in detecting fractures at imaging [11]. Recently, Hendrix et al [12] and Yoon et al [13] respectively demonstrated that AI software can achieve radiologist-level performance on scaphoid fracture diagnosis in conventional radiographs and that it can detect occult fractures with high accuracy, thereby showing that AI has the potential to aid radiologists in detecting scaphoid fractures. Large-scale retrospective studies conducted by Duron et al [14] and Guermazi et al [15] showed that AI software could indeed improve the sensitivity and specificity of radiologists and other physicians in detecting various skeletal fractures. However, performance measures for scaphoid fracture diagnosis were lacking and only Duron et al let musculoskeletal (MSK) radiologists participate in their observer study, who are specialized in diagnosing skeletal fractures. Moreover, previous works [12, 13, 16, 17] on automated scaphoid fracture diagnosis only involved the use of anterior-posterior (AP) and posterior-anterior (PA) radiographs, whereas in clinical practice multiple radiographic views such as oblique and lateral views are used. These limitations raise the question whether previous findings hold when scaphoid fracture diagnosis is conducted with multi-view radiographs and whether AI software can improve the performance of radiologists in this setting, particularly that of MSK radiologists.

The purpose of this study was therefore to assess how an AI algorithm performs against experienced MSK radiologists in detecting scaphoid fractures on conventional multi-view radiographs and to assess whether it can aid MSK radiologists in clinical practice.

Materials and methods

Datasets

This retrospective study was approved by the medical ethical review boards of the Radboud University Medical Center (Radboudumc) and the Jeroen Bosch Hospital (JBZ) in the Netherlands. Informed written consent was waived, and data collection and storage were performed in accordance with local guidelines. Dataset 1 (12,990 radiographs [from 3353 patients] acquired during 2003–2019 at Radboudumc) and dataset 2 (1117 radiographs [from 394 patients] acquired during 2018–2019 at JBZ ) served for respectively training and testing two auxiliary convolutional neural networks (CNNs) for scaphoid localization and laterality classification. Dataset 3 (4316 radiographs [from 840 patients] acquired during 2003–2019 at Radboudumc) and dataset 4 (688 radiographs [from 209 patients] acquired during 2011–2018 at JBZ) served for respectively training and testing a CNN-based fracture detection algorithm. The training and test datasets were gathered at different hospitals to assess the generalization performance of the algorithm. Only radiographs acquired at the initial hospital visit were included in dataset 4 as we focused on early fracture detection. Furthermore, the number of available radiographic views varied per study and patient. An overview of the characteristics of the datasets is provided in Table 1 (refer to Appendix E1 [online] for additional imaging parameters). This table also describes a patient overlap between these datasets and those from Hendrix et al [12]. Data from the previous study was added to the training datasets (datasets 1–3) to reduce the annotation effort, and ten patients (out of 209 [5%]) overlapped between the test dataset of the present study (dataset 4) and previous study as result from random sampling. The annotation protocol and the inclusion and exclusion criteria are provided in Appendix E2 and E3 (online). A flow chart of the study selection for dataset 4 is shown in Fig. 1.

Table 1 Details of the experimental datasets
Fig. 1
figure 1

Flowchart for the inclusion and exclusion of samples in dataset 4 (test data). The number of studies at each step is denoted with n. DBC, diagnosis-treatment combination; ICD-10, International Classification of Diseases Version 10; JBZ, Jeroen Bosch Hospital. 2Studies with ICD-10 diagnosis code S62.00 were included. 3There was no patient overlap between the samples from the non-fracture and fracture category. 4Studies were excluded when the wrist was in cast, the scaphoid was incompletely depicted, or when there was severe scapholunate advanced collapse

Ground truth

Two MSK radiologists (K.v.D. and M.R., with 22 and 26 years of experience, respectively) determined the ground truth for dataset 4. For each patient they determined the fracture presence in six scaphoid regions as defined by Wong and Ho [18]. These regions included the following: scaphoid tubercle (A1), distal articular (A2), distal 1/3 (B1), middle 1/3 (B2), proximal 1/3 (B3), and proximal pole (C). All cases were independently reviewed and disagreements were resolved by consensus reading. Both radiologists had access to all available imaging information (conventional radiography, CT, and MRI studies) and clinical information (clinical questions, and patient demographics and history) in the PACS and electronic health record (EHR) system (refer to Appendix E4 [online] for an overview of the reference standards).

AI pipeline

The pipeline of the scaphoid fracture detection AI algorithm is summarized in Fig. 2. The algorithm was designed for processing a radiographic study with an arbitrary number of series and it was implemented on an NVIDIA RTX Titan graphics processing unit with the PyTorch machine learning framework [19]. First, the scaphoid localization and laterality classification CNN localized the scaphoid and determined its orientation (frontal view, including [ulnar-deviated] AP/PA and oblique, or lateral view) and laterality (left or right hand). The scaphoid was then extracted from the image and was analyzed by either a frontal view or lateral view fracture detection CNN. In this analysis, a fracture score was generated for each of the six scaphoid regions as defined by Wong and Ho [18]. All processing steps were repeated for every input image and finally the maximum fracture score per region and per hand was selected. The whole algorithm is freely available at https://grand-challenge.org/algorithms/multiview-scaphoid-fracture-detection/, where it can be run in a web browser. A detailed description of the processing steps and training procedure is provided in Appendix E5 and E6 (online).

Fig. 2
figure 2

Overview of the scaphoid fracture detection artificial intelligence (AI) pipeline, which consisted of four convolutional neural networks (CNNs): a scaphoid localization CNN, scaphoid laterality classification CNN, and two scaphoid fracture detection CNNs for processing frontal or oblique view radiographs (including anterior-posterior/posterior-anterior [AP/PA] and ulnar-deviated AP/PA views) and lateral view radiographs separately

Observer study

To validate the performance of the fracture detection AI algorithm as well as its potential value as a computer-aided diagnosis system, an observer study was conducted among five experienced MSK radiologists with 7, 5, 22, 24, and 26 years of experience (S.B., S.S., B.M., M.d.J., M.M.). For each patient in dataset 4, the radiologists independently assessed each of the six scaphoid regions as defined by Wong and Ho [18] for the presence of a fracture. They indicated their confidence for each region on a continuous scale from 0 to 1.0, where 1.0 indicates absolute certainty of a fracture and 0.5 is the cut-off point for determining whether a fracture was present. In cases where radiographs of both hands were taken, the radiologist indicated to which hand(s) their ratings applied. After a 4-month washout period, the radiologists repeated assessing all cases using the same protocol while being provided with the predictions of the algorithm. To minimize potential recall bias after the washout period, the order of the cases was shuffled.

Statistical analysis

The auxiliary scaphoid localization and laterality classification CNN were separately evaluated on datasets 1 and 2. The evaluation details are provided in Appendix E7 (online). The fracture detection AI algorithm was cross-validated on dataset 3 using 10 folds (no patient overlap) and was tested on dataset 4. The evaluation metrics included the following: sensitivity, specificity, PPV, Cohen’s kappa coefficient (κ), area under the receiver operating characteristic curve (AUC), mean precision in localizing the fracture locations per scaphoid (“mean localization precision” [MLP]), and reading time (in seconds). The detection threshold that maximized the F1-score was chosen for the analysis. The fracture scores of the algorithm were based on automated image crops and laterality labels. The radiographic inputs from datasets 3 and 4 were respectively grouped by study and patient, and the corresponding scores were grouped by hand. Cohen’s κ was used for measuring the agreement between the algorithm and radiologists.

The evaluation metrics were calculated using the scikit-learn machine learning library (version 0.23.2, 2021) [20] for Python. Stratified bootstrapping with 1000 iterations was applied for estimating 95% confidence intervals (CIs). Significance testing was performed with two-sided paired permutation tests with 1000 iterations using the MLxtend library (version 0.19.0, 2021) [21] for Python. A difference with a p value smaller than .05 was considered significant.

Results

Test dataset characteristics

From the initial sample of 292 studies selected for the test dataset (dataset 4), 20 studies were excluded due to radiographs unsuitable for fracture diagnosis (n = 7), inconclusive evidence (n = 7), and non-acute fractures (n = 6) (see Fig. 1). This resulted into a final selection of 272 studies from 209 patients (mean age, 39 years ± 23 [standard deviation]; 107 women). The studies were grouped by patient and hand into 219 cases, of which 65 cases contained a scaphoid fracture (see Table 1). There was at least one PA view in all cases and there was at least one ulnar-deviated PA, oblique, and lateral view in 55 cases (30 with fracture), 159 cases (52 with fracture), and 156 cases (63 with fracture), respectively.

Fracture detection by AI

A quantitative and qualitative analysis of the scaphoid localization and laterality classification results are included in Appendix E8 (online). The scaphoid fracture detection AI algorithm obtained an AUC of 0.89 (95% CI: 0.87, 0.91) on dataset 3. The corresponding ROC curve and additional evaluation metrics are included in Appendix E9 (online). Table 2 presents the sensitivity, specificity, PPV, MLP, and AUC with their 95% CIs of the scaphoid fracture detection AI algorithm for multiple input configurations on the test dataset (dataset 4). The results are presented for all views and for each combination of PA views and one of the following views: ulnar-deviated PA, oblique, and lateral. The ROC curve with operation points and MLP curve (with 95% CI bands) for all views is shown in Fig. 3a. The ROC curve of each input configuration is shown in Fig. 3b. The fracture detection performance of the algorithm increased when PA views were supplemented with ulnar-deviated PA views (AUC, 0.79 vs. 0.84, p = .002), oblique views (AUC, 0.79 vs. 0.85, p = .02), and all available views (AUC, 0.79 vs. 0.88, p = .01), but not with lateral views (AUC, 0.79 vs. 0.83, p = .12). The median processing time per case (all inputs) and per radiograph was 0.97 and 0.28 s, respectively.

Table 2 Scaphoid fracture detection results of the AI
Fig. 3
figure 3

a Receiver operating characteristic (ROC) curve (blue) with operating point of the automated scaphoid fracture detection results based on all available radiographic views from dataset 4 (65 fracture cases, 154 non-fracture cases; each case represents one hand from one patient). The corresponding mean localization precision curve (orange) is shown as well. The shaded bands represent 95% confidence intervals. The black line represents no ability to discriminate between fracture and non-fracture cases. b Receiver operating characteristic (ROC) curves of the automated scaphoid fracture detection results for multiple input configurations on the same dataset. All 219 cases included at least one posterior-anterior (PA) view (65 cases with fracture), 55 cases included at least one ulnar-deviated PA view (30 cases with fracture), 159 cases included at least one oblique view (52 cases with fracture), and 156 cases included at least one lateral view (63 with fracture)

Radiologist performance in scaphoid fracture detection with and without AI assistance

Table 3 presents the sensitivity, specificity, PPV, MLP, AUC, and median reading time with their 95% CIs and p values of the five MSK radiologists for scaphoid fracture detection with and without AI assistance. The corresponding ROC curves are shown in Fig. 4a and b. The ROC curves with operating points and MLP per radiologist are shown in Appendix E10 (online). With AI assistance, three radiologists obtained a higher specificity (Rad2, 94% vs. 84%, p = < .001; Rad3, 97% vs. 88%, p = .003; Rad5, 90% vs. 81%, p = .03), three radiologist obtained a higher PPV (Rad2, 83% vs. 66%, p = < .001; Rad3, 91% vs. 74%, p = .006; Rad5, 77% vs. 65%, p = .04), one radiologist obtained a lower AUC (Rad3, 0.81 vs. 0.91, p = .002), and four radiologists had a lower reading time (Rad2, 27 vs. 16, p = < .001, Rad3, 21 vs. 11, p = < .001, Rad4, 13 vs. 6, p = < .001, Rad5, 35 vs. 14, p = < .001). In all other cases, AI assistance had no effect on the sensitivity, specificity, PPV, MLP, AUC, and reading time.

Table 3 Scaphoid fracture detection results of the radiologists
Fig. 4
figure 4

a Receiver operating characteristic (ROC) curves of the results of the scaphoid fracture detection algorithm and those of the musculoskeletal (MSK) radiologists without artificial intelligence (AI) assistance on dataset 4 (65 fracture cases, 154 non-fracture cases; each case represents one hand from one patient). b ROC curves of the results of the scaphoid fracture detection algorithm and those of the MSK radiologists with AI assistance on the same dataset. The corner of each plot is magnified for easier comparison of the curves. The black line represents no ability to discriminate between fracture and non-fracture cases.

Table 4 shows the Cohen’s kappa coefficients with their 95% CIs and p values for the fracture detection agreements between the radiologists (with and without AI assistance) and the AI algorithm. Overall, the radiologists were in a moderate to substantial agreement with each other (range without AI assistance: 0.50–0.71; range with AI assistance: 0.62–0.79). With AI assistance, five out of the ten pairs of radiologists had a higher agreement (Rad1-Rad3, 0.56 vs. 0.79, p = .002; Rad2-Rad3, 0.50 vs. 0.71, p = .006; Rad2-Rad4, 0.58 vs. 0.74, p = .02; Rad3-Rad4, 0.53 vs. 0.74, p = .003; Rad3-Rad5, 0.52 vs. 0.68, p = .03), whereas the agreement between all other pairs of radiologists remained unchanged. With AI assistance, two out of the five radiologists had a higher agreement with the algorithm (Rad3, 0.56 vs. 0.72, p = .02; Rad4, 0.62 vs. 0.80, p = < .001), whereas all other radiologists agreed equally well with the algorithm. The proportion of correctly and incorrectly changed fracture diagnoses by the radiologists with AI assistance and their correlation with the automated fracture scores are shown in Appendix E11 (online).

Table 4 Scaphoid fracture detection agreement between radiologists and AI in terms of Cohen’s kappa

Comparison of AI performance with experienced MSK radiologists

The AI algorithm and the MSK radiologists (unassisted) had a similar performance in detecting scaphoid fractures (AUC, 0.88 vs. 0.87 [average of radiologists, range: 0.84–0.91], see Tables 2 and 3; Rad1, p = .89; Rad2, p = .32; Rad3, p = .32, Rad4, p = .90; Rad5, p = .46). Two radiologists had a higher MLP than the algorithm (Rad2, 97% vs. 87%, p = .01; Rad5, 94% vs. 87%, p = .048), whereas there was no difference in MLP between the other radiologists and the algorithm (92% [average of radiologists, range: 91–94%] vs. 87%, see Tables 2 and 3; Rad1, p = .11, Rad3, p = .21, Rad4, p = .24).

A follow-up analysis of the decisions made by the algorithm and MSK radiologists revealed that six out of the 29 mistakes of the algorithm (three fracture cases, three non-fracture cases) were not made by any of the radiologists. Conversely, 12 out of the 23 mistakes made by the majority of radiologists (three fracture cases, nine non-fracture cases) were not made by the algorithm. The failure cases of the algorithm and radiologists are shown in Figs. 5 and 6 respectively.

Fig. 5
figure 5

False positive (FP) and false negative (FN) detections made by the scaphoid fracture detection artificial intelligence (AI) algorithm that none of the five musculoskeletal radiologists made. The AI fracture score per scaphoid region (ranging from 0 [no fracture] to 1 [fracture], rounded to two decimals) is shown below each image. The yellow arrows indicate the fracture locations and are only shown for reference. False positive case descriptions (from top to bottom): 13-year-old male and 77-year-old female with an intact scaphoid, 81-year-old male with rheumatoid arthritis with an old healed waist scaphoid fracture. False negative case descriptions (from top to bottom): 67-year-old female with a slightly displaced transverse waist scaphoid fracture (transition middle one-third to distal one-third), 37-year-old male with a waist oblique scaphoid fracture (transition proximal one-third to middle one-third), 77-year-old female with a waist scaphoid fracture (middle one-third)

Fig. 6
figure 6

False positive (FP) and false negative (FN) detections made by the majority of the five musculoskeletal radiologists that the artificial intelligence (AI) algorithm did not make. The proportion of radiologists making the FP or FN detection is shown in the upper right corner of each image. The corresponding fracture scores per scaphoid region of the radiologists (mean score of responsible radiologists per region) and AI (ranging from 0 [no fracture] to 1 [fracture], rounded to two decimals) are shown below each image. The yellow arrows indicate the fracture locations and are only shown for reference. Case descriptions first row (left to right): 27-year-old female, 12-year-old male, and 50-year-old female with an intact scaphoid. Case descriptions second row (left to right): 74-year-old female and 59-year-old female with an intact scaphoid, 79-year-old female with calcium pyrophosphate deposition arthritis with calcifications surrounding the triangular fibrocartilage complex. Case descriptions third row (left to right): 69-year-old female with osteophyte and subchondral cyst formation, 45-year-old female with an intact scaphoid, 74-year-old male with radiocarpal and scapho-trapezium/trapezoid joint arthritis. Case descriptions fourth row (left to right): 23-year-old male with a waist scaphoid fracture (middle one-third), 60-year-old female with a waist scaphoid fracture (distal one-third), 12-year-old female with a waist scaphoid fracture (middle one-third)

Discussion

Patients with a clinically suspected scaphoid fracture often receive unnecessary wrist immobilization, as acute scaphoid fractures can cause severe damage to the wrist if they remain undetected and untreated. This study assessed how a CNN-based AI algorithm performed against five MSK radiologists and whether it aided diagnosis of scaphoid fractures on conventional multi-view radiographs. The algorithm was shown to be able to detect scaphoid fractures as well as the MSK radiologists (AUC, 0.88 vs. 0.87 [average of radiologists]; p ≥ .05 for all radiologists). Moreover, it was able to indicate which regions in the scaphoid were fractured as precisely as the majority of radiologists (MLP, 87% vs. 92% [average of radiologists], p ≥ .05 for majority). Furthermore, AI assistance improved five out of ten pairs of inter-observer Cohen’s κ agreement (average increase of 36.2%, p < .05 for all pairs) and reduced the reading time of four radiologists (average reduction of 49.4%, p < .001 for all radiologists), but no improvements were found in sensitivity, specificity, PPV, and AUC for the majority of radiologists.

The results showed that the scaphoid fracture detection performance of the AI algorithm improved when PA views were supplemented with ulnar-deviated and oblique PA views, whereas adding lateral views did not lead to a performance increase. These findings underline the conclusions of Cheung et al [22] that the PA, ulnar-deviated PA and oblique view are most important for scaphoid fracture detection and indicate that this also applies to deep learning–based AI algorithms. This implies that a multi-view approach to scaphoid fracture detection should be adapted in future AI applications.

A qualitative analysis of the failure cases revealed that the algorithm made six mistakes (three fracture cases, three non-fracture cases) that none of the MSK radiologists made. The false positive detections were likely to be caused by overprojection lines of the other carpal bones on the scaphoid on the lateral view and a sclerotic line from an old healed fracture. The false negative cases included two scaphoids with an evident, but non-sharply delineated scaphoid waist fracture. The latter finding is in line with the observations of Hendrix et al [12] and Langerhuizen et al [23] that deep learning–based AI algorithms may miss fractures that are evident to human observers.

There were 12 mistakes made by the majority of MSK radiologists (three fracture cases, nine non-fracture cases) that were not made by the algorithm. In most of the false positive cases (5/9), the scaphoid and its surrounding joints showed degenerative signs or slight deformities, which could suggest a fracture even when no hypodense line was visible. The remaining false positive detections were caused by very subtle or diffuse hypodense lines. The false negative detections were made in scaphoids with a displaced fracture causing a subtle protrusion of the cortical bone with no evident fracture line visible. Similar false negative detections made by radiologists but not by an AI algorithm can be observed in Hendrix et al [12], but qualitative analyses of false positive detections are lacking in previous studies. While these findings suggest that the algorithm may have merit in aiding the interpretation of degenerative or malformed scaphoids for fracture diagnosis, follow-up studies are required to confirm this.

The found positive effects of AI assistance on the inter-observer agreement and reading time of the radiologists provide preliminary evidence that the algorithm could improve the diagnostic efficiency of MSK radiologists with the same diagnostic accuracy. However, one radiologist had a significantly lower sensitivity and AUC in the AI assistance condition, but the incorrectly changed answers were weakly correlated with the answers from the algorithm (0.34). The decreases in reading time are in line with the conclusions from Duron et al [14] and Guermazi et al [15], but we did not find any increases in sensitivity or consistent increases in specificity. This difference could be due to the few carpal fractures in their test datasets. Furthermore, even though it was not investigated in this study, it could be expected that general radiologists, radiology residents, and other physicians may benefit more from AI assistance.

The strengths of this study included the use of multi-center clinical data and external validation, participation of five experienced MSK radiologists, and the evaluation of an automatic multi-view scaphoid fracture detection AI algorithm. However, this study also had some limitations. First, we aimed to minimize selection bias in our test set by using all available information in the PACS and EHR system rather than using only studies with a follow-up CT or MRI scan for testing. This means that occult scaphoid fractures might have been labelled as negative cases when the patient did not return to the hospital with persistent symptoms. The reference standard quality and selection bias trade-off problem could be circumvented by conducting a prospective study in which patients immediately undergo a CT and MRI scan after an initial examination with conventional radiography, but this would be too time intensive and costly for acquiring sufficient data.

Second, even though we investigated the contribution of each radiographic view to automated scaphoid fracture detection, we simplified the model architecture by processing AP/PA, ulnar-deviated AP/PA, and oblique view radiographs by a single CNN. The model performance might be further improved in future research by training separate CNNs for all views.

In conclusion, the findings presented in this study support the hypothesis that an AI algorithm can achieve MSK radiologist level performance in detecting scaphoid fractures on conventional multi-view radiographs. Moreover, there is preliminary evidence that AI assistance could improve the diagnostic efficiency of MSK radiologists, but not their diagnostic accuracy. Future research should evaluate the impact of AI assistance on diagnostic performance, clinical decision making, and patient outcomes in a randomized clinical trial involving both radiologists and non-radiologists.