Introduction

Scaphoid fractures are the most common type of carpal bone fracture, with an incidence of 82–89% [1]. Early detection and immobilization are crucial for the successful management of scaphoid fractures [2,3,4] since neglected untreated fracture of the scaphoid may lead to complications such as avascular necrosis, carpal instability, and early osteoarthritis [4,5,6,7,8].

X-rays are commonly used to identify scaphoid fractures due to their accessibility and cost-effectiveness. However, the sensitivity for diagnosing scaphoid fracture on wrist radiographs is relatively low (66–81%) with a likelihood of up to 50% of occult scaphoid fractures [9,10,11,12,13]. Radiographic wrist interpretation of scaphoid fracture poses a challenge and is a frequent cause of delayed or incorrect diagnosis in emergency departments. Radiographs are often evaluated by non-specialized physicians or junior orthopedic residents potentially affecting accuracy and leading to missed fractures [14]. Thus, possibly due to the presence of less experienced personnel or fatigue, there is a higher incidence of misdiagnosis during evening and overnight hours, on top of the relatively constant shortage of emergency radiologists available to handle the workload [15].

While advanced imaging modalities such as CT or MRI are more sensitive than radiographs for the detection of occult scaphoid fractures, there are inherent disadvantages in these modalities such as increased radiation exposure by CT and high medical cost by MRI [2, 16, 17].

Therefore, there is both clinical and economic advantage in improving the sensitivity and detection rate of scaphoid fractures on plain radiographs. Convolutional Neural Networks (CNNs) are prominent artificial intelligence (AI) deep learning algorithms for image analysis [18, 19]. CNNs are specifically designed to handle image data by exploiting repeating patterns. They have been used for various medical image processing, such as radiology images, skin lesions, retinal scans, endoscopic images, and histopathologic specimens [19,20,21,22,23,24,25,26,27,28,29].

AI algorithms have proven to be effective in identifying various acute pathologies that are frequently encountered in emergency departments, for example, pulmonary embolism, intra-abdominal free gas, intra-cranial hemorrhage, and femoral neck fractures [15]. Indeed, AI methods have been explored for improving sensitivity in the detection of scaphoid fractures on wrist radiographs.

The aim of this systematic review and meta-analysis is to comprehensively evaluate the existing data on the use of AI for detecting scaphoid fractures on wrist radiographs. The review provides a comprehensive analysis of the usefulness and accuracy of current AI systems in this field, and explores its potential applications in the future.

Materials and methods

The systematic review followed the guidelines of Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) and PRISMA-Diagnostic Test Accuracy [30, 31]. A literature search was conducted in the PubMed database for relevant studies published until July 2023, using search terms: (“Scaphoid Bone”[Mesh] or “Wrist”[Mesh] or “scaphoid” or “scaphoid fracture” or “Wrist Fractures”[Mesh]) and (“deep learning”[Mesh] or “artificial intelligence”[Mesh] or “convolutional neural network” or “CNN” or “Neural Networks, Computer”[Mesh]). The bibliographies of the included studies were also searched for additional relevant studies.

Inclusion criteria for studies were (1) use of AI methods for scaphoid fracture diagnosis on wrist radiographs, (2) reporting of statistical analysis of area under the ROC (receiver operating characteristic) and/or accuracy, (3) publication in English, and (4) original articles.

Data from all included studies was collected using a standardized data extraction sheet, including publication year, journal name and affiliation, study design, study period, number and views of wrist radiographs, number of patients and images, fracture type (visible or occult), standard of reference for diagnosis, number of pipeline steps, AI name, CNN architecture, ROI labeling, data input proportion, diagnostic accuracy/AUC, sensitivity, specificity, and true/false positive/negative. The Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) was used to assess bias and applicability [32].

Statistical analysis

The results of the systematic review and meta-analysis on AI for the detection of scaphoid fractures in wrist radiographs were analyzed based on the contingency table of true positives, false negatives, true negatives, and false positives constructed for each study.

A bivariate diagnostic random-effects meta-analysis was conducted and the findings were assessed with the Summary Receiver Operating Characteristic (SROC) curve [33]. The SROC curve considered the sensitivity and specificity values for a given test, taking into account the various cut-off points in each independent study. The area under the curve (AUC) was calculated for the fitted SROC. AUC is an overall summary of diagnostic accuracy, commonly used when reporting the performance of AI classification models. The ROC curve is plotted with the true positive rate (sensitivity) against the false positive rate (1-specificity). The higher the AUC, the better the model’s performance is at distinguishing between scaphoid fracture and non-fracture plain radiographs. AUC results are considered excellent for values between 0.9 and 1, good for AUC values between 0.8 and 0.9, and fair for AUC values between 0.8 and 0.7. An AUC less than 0.7 is considered poor. Accuracy is defined as the number of correct predictions divided by the total prediction number (true positives + true negatives/total dataset). The across-study heterogeneity was assessed using the I2 statistic. The analyses were performed using the R 4.0.5 software and the “mada” and “meta” packages.

Results

Included studies and dataset sizes

The bibliographic search retrieved 303 studies, of which nine studies met the inclusion criteria and one more was retrieved by manual searching [2, 16, 34,35,36,37,38,39,40,41]. A flow diagram of the study selection process is presented in Fig. 1, and the characteristics of the included studies are summarized in Table 1. The studies were all retrospective and published between 2020 and 2023.

Fig. 1
figure 1

Flow diagram of the search and inclusion process. The study was reported according to the Preferred Reporting Items for Systematic Reviews and Meta-analyses guidelines

Table 1 Included study characteristics

The ten included studies ranged in sample size, with the number of images used varying from 356 to 11,838. Table S1 (Supplementary Material) provides a summary of the full numerical data of the studies, including the reference standard (“ground truth”) for diagnosis.

Brief summary of the included studies

Langerhuizen et al [34] developed a single-step CNN model that consisted of only the scaphoid fracture detection step; scaphoid segmentation was done manually. The model showed fair diagnostic accuracy, with an AUC of 0.77, and diagnostic performance was compared to five orthopedic surgeons. The clinicians had better specificity, but the CNN performed similarly in terms of accuracy and sensitivity. However, the algorithm struggled with identifying obvious fractures and had some false positive suggestions.

Ozkaya et al [16] also developed a single-step fracture detection CNN, while scaphoid segmentation was done manually. The CNN’s diagnostic performance was compared with an emergency department physician and two hand surgeons. The CNN had a good AUC of 0.84. The CNN’s performance was comparable to that of a less experienced orthopedic specialist and outperformed the emergency department physician.

Yoon et al [35] developed a 3-step CNN model. Firstly, the scaphoids were localized by a segmentation CNN. Afterwards, two CNNs were used consecutively, one to detect scaphoid fractures, followed by another CNN which examined the negative cases of the previous CNN, in order to detect the occult fractures. The fracture detection CNN achieved excellent diagnostic performance with an AUC of 0.955. The “occult fracture detection” CNN, had a good AUC of 0.81. The full model successfully identified 90.9% of occult fractures.

Hendrix et al [36] developed a 2-step CNN model and compared its diagnostic performance to that of eleven radiologists with various experiences. The first CNN localized the scaphoids and automatically passed the region of interest to the fracture detection CNN. The CNN had a good AUC of 0.87 and demonstrated a performance level comparable to that of the radiologists.

Tung et al [37] developed a 2-step CNN model: scaphoid segmentation CNN followed by scaphoid fracture detection CNN. Ten different CNNs for fracture detection were used and compared, with an AUC ranging from 0.86 to 0.95.

Yang et al [38] developed a 2-step CNN model: scaphoid segmentation CNN and scaphoid fracture detection CNN, which achieved an excellent diagnostic performance with an AUC of 0.917.

Li et al [2] developed a 2-step CNN model: scaphoid segmentation and scaphoid fracture detection; and compared its diagnostic performance to that of four hand surgeons. The fracture detection CNN achieved an excellent diagnostic performance with an AUC of 0.92. They concluded that CNN’s ability was comparable to the majority decision of the surgeons, and has the potential to achieve expert-level performance.

Hendrix et al [39] developed a 2-step CNN model: scaphoid segmentation and laterality classification CNN, and scaphoid fracture detection CNN, and assessed the diagnostic performance of the AI system. They also conducted an observer study in order to clinically validate the AI system performance, and compared the diagnostic performance of five musculoskeletal-expert radiologists with and without AI assistance. The CNN had a good AUC of 0.88, and it was comparable to the average performance (AUC 0.87) of the five radiologists. The AI assistance improved the agreement between radiologists in 5/10 pairs, and it reduced reading time for 4/10 radiologists. However, the algorithm did not improve the diagnostic performance of the majority of radiologists.

Cohen et al [40] used a commercially available AI system and assessed its diagnostic performance in detecting different wrist fractures, including scaphoid fractures. They also conducted an observer study in order to clinically validate the AI system, and compared the diagnostic performance of initial radiology reports (IRR) made by 41 radiologists with various experiences, which are not specialized in musculoskeletal imaging. Afterwards, the IRRs the AI results alone, and the combination of the two were compared. IRR + AI observation was considered positive when it was detected by either the AI or the IRR, regardless of the other’s group result. The sensitivity for scaphoid fracture detection was similar for both AI (84%) and IRR (80%). Analysis of the full cohort, meaning all fracture locations, showed that the combination of IRR + AI had greater sensitivity compared to either AI alone or IRR alone.

Lee et al [41] developed an AI model to detect three common types of wrist fractures: distal radius, ulnar styloid process, and scaphoid fractures. The model consisted of two CNNs: scaphoid segmentation and scaphoid fracture detection, which operated simultaneously and were integrated into a final assessment. Two novice radiologists also diagnosed the fracture sites, both with and without the assistance of the AI model. The diagnostic performance of the AI model was evaluated and compared to the novice radiologists. The AI model had a good AUC of 0.808 for scaphoid fracture detection. When novice radiologists were assisted by the AI model, the AUC for detecting scaphoid fractures significantly increased from 0.75 to 0.85 and from 0.71 to 0.80.

Radiograph characteristics

Two studies [34, 39] used a full wrist series of 4 projections (either anterior–posterior (AP), posterior-anterior (PA), lateral and oblique or AP/PA, ulnar-deviated AP/PA, lateral and oblique). Two studies [2, 35] used 2 projections (PA or scaphoid view), one study [41] used 3 projections of wrist radiographs (AP, lateral, oblique), one study [36] used several hand, wrist, and scaphoid projections, one study [37] used 2 projections (AP and lateral), and one study [16] used 1 projection (AP). Two studies [38, 40] did not specify which projections were used. Exclusion criteria were specified in seven studies [2, 34,35,36, 39,40,41] and included poor radiographic quality, old fractures, immobilization devices, hardware, chronic hand disorders, arthritis, and tumors. The other three studies [16, 37, 38] did not specify any exclusion criteria.

CNN pipeline structure and AI characteristics

One study [40] used commercially available software, while the other nine studies [2, 16, 34,35,36,37,38,39, 41] developed CNNs using radiographs for AI training and testing.

Data on the AI characteristics is shown in Table S2 (Supplementary Material). Nine different CNN architecture types were used in the AI models that were developed, along with various augmentation techniques. Eight studies [2, 16, 34,35,36, 38, 39, 41] utilized a single CNN architecture, which differed in each study. One study [37] compared 10 different CNN architecture types.

The most common CNN models used among the studies were VGG16 [34, 37], Resnet-50 [16, 37], DenseNet-121 [36, 37], ResNet-152 [37, 38], and Inception-V3 [37, 39], with two each. Three studies [2, 35, 36] used the Grad-Cam function to create a heatmap based on the input image and highlight the fracture area. Another heatmap was used in one study [41], and a bounding box to highlight the “zone-of-interest” was integral to the commercially available AI system that was used in a single study [40].

Two studies [34, 36] used a 1-step pipeline model with a CNN solely for fracture detection. Scaphoid segmentation was done manually, by cropping and resizing the scaphoid to fit in a rectangular ROI, which then served as input to the fracture detection CNN.

Six studies [2, 36,37,38,39, 41] used a 2-step pipeline model: a scaphoid segmentation CNN followed by a fracture detection CNN.

An example of such CNN is illustrated in Fig. 2 [36].

Fig. 2
figure 2

© Radiological Society of North America

Example of scaphoid fracture detection pipeline, which consists of segmentation and detection of Convolutional Neural Networks (CNN). A class activation map is calculated and visualized as a heatmap for fracture localization. Reproduced with permission: Fig. 1, Hendrix N, Scholten E, Vernhout B, et al. Development and Validation of a Convolutional Neural Network for Automated Detection of Scaphoid Fractures on Conventional Radiographs. Radiology: Artificial Intelligence: Published Online April 28, 2021. https://doi.org/10.1148/ryai.2021200260.

One study [35] used a 3-step pipeline model, with a scaphoid segmentation CNN followed by a fracture detection CNN. The negative results from the first two steps were re-evaluated in a third CNN, designed to diagnose occult fractures missed in previous pipeline steps. The entire pipeline was then tested separately on new images. One study used a commercial system, in which scaphoid segmentation and fracture detection were integral [40].

AI performance for scaphoid fracture detection

Data on AI compared to human performance is shown in Table S3 (Supplementary Material). The AI performance of scaphoid fracture detection varied from AUC 0.77 to 0.96. Accuracy of fracture detection ranged from 72.0 to 90.3%. Sensitivity and specificity for each CNN are shown in forest plots (Figs. 3 and 4).

Fig. 3
figure 3

Forest plot of the pooled sensitivity of studies reporting on deep learning detection of scaphoid fractures

Fig. 4
figure 4

Forest plot of the pooled specificity of studies reporting on deep learning detection of scaphoid fractures

Seven studies reported deep learning algorithm performance, with a combined total of 3373 images. The pooled sensitivity and specificity were 0.80 (95% CI 0.75–0.84) and 0.89 (95% CI 0.82–0.94) respectively (Figs. 3 and 4).

Tung et al [37] compared 10 different CNNs applied to the same dataset and found AUCs ranging from 0.86 to 0.95. The other eight studies [2, 16, 34,35,36, 38, 39, 41] also reported high AUC, from 0.77 to 0.96.

The AUCs were fair for 1 CNN (VGG16 [34]), good for 8 CNNs (ResNet50 [16], Densenet 121 [36], VGG16 + VGG19 + Reset152 + DenseNet169 [37], Inception-V3 [39], NasNet [41]), and excellent for 10 CNNs (EfficientNetB3 [35], ResNet50 + ResNet101 + DenseNet121 + DenseNet201 + InceptionV3 + EfficientNetB0 [37], ResNet152 [38], MobileNetV3 [2]).

Table 2 shows that while the same CNN architecture can have close results reported by different authors, they are not identical.

Table 2 Comparison of AUC results in identical CNN architectures according to different studies

Hendrix et al [39] assessed their algorithm’s performance using multiple input configurations for various X-ray projections including PA, ulnar-deviated PA, oblique, and lateral. The algorithm’s fracture detection performance improved when PA views were supplemented with ulnar-deviated PA views (AUC, 0.79 to 0.84), oblique views (AUC, 0.79 to 0.85), and all available views (AUC, 0.79 to 0.88). However, there was no significant improvement with the addition of lateral views.

The Summary Receiver Operating Characteristic (SROC) curve plot of deep learning models for the diagnosis of scaphoid fractures on wrist radiographs is presented in Fig. 5. The area under the SROC curve is 0.88, pooled sensitivity is 80%, and specificity is 89%.

Fig. 5
figure 5

Summary Receiver Operating Characteristic (SROC) curve plot of deep learning models for the diagnosis of scaphoid fractures on wrist radiographs. Individual studies are shown as open triangles. Summary point is shown as an open circle, representing sensitivity estimates pooled by using bivariate random-effects regression model. The 95% confidence region (Conf. region) displays the 95% confidence interval of the pooled sensitivity and specificity. There was high heterogeneity among the studies, with I2 ranging from 80 to 90%

Diagnosis of occult fractures

Langerhuizen et al [34] reported that 5 out of 6 occult scaphoid fractures missed by human observers were detected by AI. Contrary, Ozkaya et al [16] reported the same accuracy rate for AI and humans with similar 7 occult fractures which were missed by both. Yoon et al [35] developed two AI models, the first detecting both apparent and occult scaphoid fractures, and the second applied to negative-fracture images from the first model for occult fractures’ detection. The occult fracture model showed an AUC of 0.81 and the entire pipeline correctly identified 90.9% of the occult fractures.

Quality assessment using the QUADAS-2 tool is summarized in Table 3.

Table 3 Methodological analysis of the included studies based on the QUADAS-2 tool

Discussion

The potential role of AI in the detection of radiographic scaphoid fractures is twofold. Firstly, it can help reduce missed scaphoid fractures, and secondly, it can aid in diagnosing occult scaphoid fractures. The current review and meta-analysis show that the overall diagnostic performance of AI in detecting scaphoid fractures is very good and near excellent. AI performance in this field demonstrates promising results in detecting both obvious and occult fractures. While studies comparing AI performance to human controls are lacking, existing data suggests that current AI systems are comparable to the performance level of human experts.

AI features and diagnostic accuracy

The relatively limited number of studies evaluating the function of AI in the detection of scaphoid fracture on wrist radiographs and their recent publication (2020 or later) reflect the fact that this field is new, active, and evolving. CNN systems have evolved from the 1-step pipeline models, reliant on manual scaphoid segmentation, to 2-step pipeline models with integral CNN-based scaphoid segmentation [2, 16, 34,35,36,37,38,39,40,41].

The preliminary results show great promise: 10/19 and 8/19 CNNs show excellent and good AUC performance, respectively. Meaning, these models were able to accurately detect scaphoid fractures in plain radiography after training on considerably small datasets. The different scores of the CNNs may stem from different layer architecture and augmentation techniques. The difference between the same CNN in different studies might come from heterogenous input data, such as the number of images, data proportion in training/validation/test sub-sets, and image quality. The radiographic projections used in the AI systems also vary among studies, leading to diverse training and testing cohorts and different results.

In a recent review [42] that included 14 studies and evaluated the diagnostic performance of AI for detecting femoral neck fracture, a mean AUC of 0.969 was reported. The AUC score is somewhat higher compared to the current pooled SROC score. Yet, it is reasonable when taking into account the lower prevalence (2–10%) of occult proximal femur fractures [43] in comparison with scaphoid fractures.

CNN performance compared to human performance

Results of the studies published so far have been promising, with AI systems showing a fair to excellent performance as measured in AUC [2, 16, 34,35,36,37,38,39, 41].

Only three studies [16, 34, 36] compared the AI diagnostic performance to human controls, in which, results varied. Langerhuizen et al [34] found no significant difference in the accuracy of diagnostic performance between the CNN and the performance of five orthopedic surgeons. Similarly, Hendrix et al [36] found no significant difference in AUC between the CNN and eleven radiologists. On the other hand, Ozkaya et al [16] found that an experienced orthopedic specialist performed better and had a higher AUC than the CNN.

Overall, the pooled sensitivity and specificity of the AI systems were 0.80 and 0.89, respectively. This suggests that AI systems have promising diagnostic performance. However, the human performance comparison groups and the information comprising the datasets in the various studies have a high degree of heterogeneity, making it difficult to draw clear conclusions on the relative performance of AI and human observers.

Three studies [39,40,41] conducted clinical validation studies, to compare human and AI diagnostic performance, and shed light on the potential benefits of utilizing AI in these contexts. Hendrix et al [39] found that while the AI system reduced reading time for certain radiologists, it did not lead to an improvement in the diagnostic performance for the majority of them. On the other hand, Cohen et al [40] reported that the implementation of the AI system increased sensitivity in fracture detection. Furthermore, Lee et al [41] demonstrated that the AI system significantly enhanced the diagnostic performance of two novice radiologists in identifying fractures. These studies collectively highlight the potential advantages of integrating AI technology in diagnostic processes, showcasing improvements in reading time and sensitivity for certain radiologists while also providing substantial enhancements in diagnostic capabilities for less experienced medical professionals.

The GRAD-CAM function was designed to integrate the AI output with current medical systems and allow clinicians to directly evaluate the AI output by highlighting the area of suspected fracture (heatmap). This function, other heatmaps, or any feature that delineates “zone-of-interest” may help direct clinical suspicion to an occult fracture, facilitating early diagnosis and treatment. Indeed, the GRAD-CAM function or similar features were utilized in five studies to highlight the area in the scaphoid most likely to be fractured, according to CNN [2, 35, 36, 40, 41]. The same function was previously utilized in AI systems developed to diagnose femoral neck fractures [42]. Cheng et al [44] reported that integrating this technology into the clinical flow is feasible and improves the diagnostic accuracy of physicians, especially novice clinicians. Sato et al [45] reported that using GRAD-CAM improved the accuracy, sensitivity, and specificity of resident-level physicians in the diagnosis of femoral hip fractures.

There are several limitations in the current review. The literature search was restricted to the English language. The available data sets for training CNNs are relatively small and heterogeneous and do not fully represent the complexity of real-life clinical cases. Furthermore, the CNN architectures differed between the AI systems, which also limits the ability to compare them. Additionally, none of the studies evaluated the performance of their algorithms in a clinical setting, where factors such as low-quality images or the presence of casts may affect diagnosis. Moreover, few studies compared the diagnostic performance of the AI systems to clinicians or radiologists; and only recent studies performed clinical validations to the AI systems.

The use of the QUADAS-2 tool in the included studies revealed a high risk of bias and concerns about applicability in 9 out of 10 studies.

These limitations highlight the need for further research and development in this field, including testing the algorithms in more diverse and representative image sets and evaluating their performance in real-life clinical settings.

In the current literature, there is a paucity of data regarding the use of AI in other modalities such as CT or MRI for detecting scaphoid fractures. This may be attributed to the fact that these modalities are typically deployed as confirmatory tools for suspected occult fractures, rather than being the initial or most commonly used modality for diagnosing and managing scaphoid fractures. Hence, the emphasis on AI development and research has been more pronounced in the context of X-ray imaging.

In conclusion, the current diagnostic performance of AI for detecting scaphoid fractures on wrist radiographs shows promising results, with a high pooled sensitivity and specificity and a high SROC result. AI systems cannot yet replace the human role in scaphoid fracture detection, but they may complement and augment the diagnostic performance of physicians. For novice healthcare practitioners, AI systems can improve the detection rate of fractures, especially in cases of occult fractures. For more experienced clinicians, AI systems may serve as a powerful diagnostic aiding tool, particularly when used in conjunction with heatmaps. Further research is required to establish a comparison between AI and human diagnostic performance in the clinical setting.

Still, existing AI systems can already be beneficial for non-expert clinicians in diagnosing both obvious and occult fracture; and aid experts in facilitating management when used in conjunction with heatmaps.