RETRACTED ARTICLE: Diagnostic power of ChatGPT 4 in distal radius fracture detection through wrist radiographs

Distal radius fractures rank among the most prevalent fractures in humans, necessitating accurate radiological imaging and interpretation for optimal diagnosis and treatment. In addition to human radiologists, artificial intelligence systems are increasingly employed for radiological assessments. Since 2023, ChatGPT 4 has offered image analysis capabilities, which can also be used for the analysis of wrist radiographs. This study evaluates the diagnostic power of ChatGPT 4 in identifying distal radius fractures, comparing it with a board-certified radiologist, a hand surgery resident, a medical student, and the well-established AI Gleamer BoneView™. Results demonstrate ChatGPT 4’s good diagnostic accuracy (sensitivity 0.88, specificity 0.98, diagnostic power (AUC) 0.93), surpassing the medical student (sensitivity 0.98, specificity 0.72, diagnostic power (AUC) 0.85; p = 0.04) significantly. Nevertheless, the diagnostic power of ChatGPT 4 lags behind the hand surgery resident (sensitivity 0.99, specificity 0.98, diagnostic power (AUC) 0.985; p = 0.014) and Gleamer BoneView™(sensitivity 1.00, specificity 0.98, diagnostic power (AUC) 0.99; p = 0.006). This study highlights the utility and potential applications of artificial intelligence in modern medicine, emphasizing ChatGPT 4 as a valuable tool for enhancing diagnostic capabilities in the field of medical imaging. Supplementary Information The online version contains supplementary material available at 10.1007/s00402-024-05298-2.


Introduction
In 2019, fractures of the distal radius ranked third in Germany with a total of 72,087 cases, surpassed only by femoral neck and femoral pertrochanteric fractures.Unlike femoral fractures, which are 20 times more common in populations over 70 years of age than in those under 70, radius fractures often also affect younger populations [1].Distal radius injuries in older individuals are typically caused by low-energy trauma.In contrast, younger individuals tend to experience higher energy trauma [2].In the clinical management of these conditions, treatment approaches may involve surgical intervention for complicated and displaced fractures or non-operative methods for simple and non-displaced fractures.The indications for

R E T R A C T E D A R T I C L E
(LLM), GleamerAI cannot translate radiological image information into language or a precise classification system.Ope-nAI, an LLM that was officially launched in November 2022, can produce speech and has already demonstrated its ability to perform medical tasks, such as passing the United States Medical Licensing Examination (USMLE) [8].A previous study has demonstrated that chatbots utilising ChatGPT 4 technology are capable of producing AO codes from radiological reports.These were significantly faster, but much less accurate in the creation of AO codes [9].On 25 September 2023, the previously text-based language model ChatGPT 4 received an update for image input and processing.Visual capabilities based on Convolutional Neural Networks (CNNs) were achieved through a training process similar to that used for ChatGPT 4 text processing [10].Firstly, ChatGPT 4 had to anticipate the next words within a document using textural and visual data sets.Secondly, refinement was achieved by adding additional data, supported by Reinforcement Learning from Human Feedback (RLHF) [11].
This improvement indicates a promising use of ChatGPT 4 in clinical practice to diagnose and classify fractures and to support and supplement clinical practicians.To assess this question, the accuracy and efficiency of ChatGPT 4, Gleam-erAI, a medical student, radiologists and a physician were compared in the detection of distal radius fractures presented to the Division of Hand, Plastic and Aesthetic Surgery within the LMU University Hospital Munich.

Methods
In the present study, we aimed to examine the diagnostic power of the AI chatbot ChatGPT 4 in the detection of distal radius fractures in wrist X-rays and compare it to the radiological report of a board-certified radiologist, a hand surgery resident, a medical student and Gleamer BoneView™ (Gleamer AI, France), a commercially available AI algorithm for fracture detection in radiographs.For this purpose, we have included 100 wrist X-rays with and 50 without distal radius fracture of patients who had received radiographs due to a suspected fracture in this study.The X-ray images were irreversibly anonymised, and a combined image was created from the ap and lateral view (Figs. 1 and 2).Afterwards, the order of the images was randomised for the following examination.
For the radiological evaluation with ChatGPT 4, the radiological images were uploaded one after the other, and the following standardised sequence of consecutive questions was used.If ChatGPT 4 did not answer one of the questions adequately, the question was paraphrased and asked again.
• The following image shows the ap and lateral view of a wrist x-ray of the same person.Can you detect a fracture on the image?Yes or No.
• If the answer was yes -Which bone is broken in the uploaded image?
The images were also examined in the same order by a hand surgery resident and a medical student in the clinical training phase regarding the above-mentioned questions.In addition, the images were analysed using the AI software BoneView™.As the software only marks fractures with a square, the marking of the distal radius in the presence of a fracture was evaluated as the correct detection of the fracture and localisation.The radiological reports of a boardcertified radiologist were used as reference.
For statistical analysis of distal radius fracture detection rate, sensitivity and specificity were calculated and receiver operating characteristic analysis was performed.

R E T R A C T E D A R T I C L E
McNemar's test was performed to analyse the sensitivity and specificity of fracture detection.All data are given as means and standard error of the mean.A p-value < 0.05 was considered statistically significant.

Results
A total of 150 wrist radiographs from the Division of Hand, Plastic and Aesthetic Surgery within the LMU University Hospital Munich were included in this study.Among the 100 distal radius fractures, 20 fractures were classified as type A, 4 as type B, and 76 as type C according to the AO classification for distal radius fracture.
The diagnostic power of each group was assessed using a receiver operating characteristic curve of sensitivity and specificity (Fig. 4).The respective area under the curve (AUC) was calculated as 0.93 (0.023) for ChatGPT 4, 0.985 (0.013) for hand surgery resident, 0.85 (0.040) for medical student, and 0.99 (0.012) for Gleamer BoneView™.AUC analysis revealed that hand surgery resident and Gleamer BoneView™ exhibited the highest diagnostic power without any statistical differences between them (p = 0.741).Both demonstrated significantly higher diagnostic power than ChatGPT 4 (p = 0.014 and p = 0.006, respectively) and medical student (both p < 0.001).The comparison of ChatGPT 4 and medical student showed a significantly higher diagnostic power of ChatGPT 4 than medical student (p = 0.04, Table 1).
In summary, ChatGPT 4 demonstrates good diagnostic power in detecting distal radius fractures in wrist radiographs.

Discussion
The diagnostic accuracy of ChatGPT 4 was compared with that of a hand surgery resident, a medical student, and the AI algorithm Gleamer BoneView™.The study shows that Chat-GPT 4 has lower diagnostic sensitivity compared to the hand surgery resident and Gleamer BoneView™, but higher precision than a medical student.
We performed receiver operating characteristic (ROC) curve analysis to quantify the diagnostic power of each observer.The area under the curve (AUC) for ChatGPT 4 was high at 0.93, reflecting good diagnostic capability, although it was lower than the AUC of the hand surgery resident and Gleamer Bon-eView™.In direct comparison, ChatGPT 4 exhibited significantly higher diagnostic power than the medical student, as demonstrated by their respective AUCs.Previous studies have investigated the use of artificial intelligence systems to improve and aid in diagnosing distal radius fractures by radiologists.Guermazi et al. showed that AI reduced the average reading time per examination by 6.3 s and increased the sensitivity [7].A good diagnostic rate of fractures was acquired using an VGG16 model by Kunihiro et al.
Recent studies showed various applications of Chat GPT in medicine.Application in radiology consist for example of translating medical reports into plain language to enhance the understanding of patients [12][13][14].It also has the potential to support radiological decision-making [15][16][17][18] and to generate AO Codes from radiologists' reports [19].To the best of our knowledge, there has been no study to date that has analysed

Conclusion
In the current study we were able to analyse the diagnostic power of ChatGPT 4 and compare it to a hand surgery resident, a medical student and Gleamer BoneView™.ChatGPT 4 has a good sensitivity (0.88), specificity (0.98), and diagnostic power assessed through AUC calculation (0.93).Although ChatGPT 4 had a significantly lower diagnostic power than the hand surgery resident and Gleamer BoneView™, it had a significantly higher diagnostic power than the medical student.It should always be considered that ChatGPT was not designed for fracture detection and the image function has only been available for a few months.
Our findings collectively suggest that while ChatGPT 4 presents a valuable tool for distal radius fracture detection, it currently lacks the diagnostic proficiency of hand surgery professionals and advanced imaging technology, such as Gleamer BoneView™.As technology continues to advance, future enhancements to ChatGPT models may further improve their diagnostic capabilities.Our study contributes valuable insights into the evolving landscape of artificial intelligence applications in medical imaging, emphasizing the importance of continued collaboration between technology developers and healthcare professionals to optimise diagnostic outcomes.DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.[19].In 2021, Tobler et al. utilised a deep convolutional neural network (DCNN) to detect and classify distal radius fractures [20].This study demonstrated the effective use of DCNNs as adjunctive tools for second readings.This work provides a basis for using ChatGPT 4, a CNN based model, in a similar task.However, models intended for fracture classification were not yet ready for clinical application.In line with previous findings Zech et al. demonstrated high accuracy of pediatric wrist fractures using an objective-detection-based deep learning approach [21].
Our study had limitations.Firstly, the study was retrospective in nature and the radiographs did not include clinical information, which resulted in a lack of important parameters such as pain localisation [28].Secondly, the training data for Chat-GPT 4 is unknown to us.We cannot comment on the size of the dataset that the model was trained on.However, deep learning models perform worse when applied to new data sets and different patients [29].Therefore, our setting for ChatGPT 4 was more difficult, as the offered scenario of fracture images was not available for training.In the context of fracture diagnostics, our investigation incorporated 150 wrist radiographs from the Division of Hand, Plastic and Aesthetic Surgery within the LMU University Hospital Munich.The fracture cohort consisted of 100 distal radius fractures, stratified into 20 type A, 4 type B, and a predominant 76 type C fractures as per the AO classification criteria.Different trauma centres report fewer type C fractures and more type A and B fractures [30,31].Therefore, our population favors higher diagnostic accuracy, as type C fractures are usually easier to detect.

Fig. 1 3 2462
Fig. 1 Combined image of wrist x-rays of a patient with distal radius fracture

Fig. 2
Fig. 2 Combined image of wrist x-rays of a patient without distal radius fracture

Table 1
Comparison of the area under the ROC curve (AUC) of Chat-GPT4, hand surgery resident, medical student, and Gleamer BoneV-