Introduction

The emergence and subsequent advancements of large language models (LLMs) like the GPT series have recently dominated global technology discourse [1]. These models represent a new frontier in artificial intelligence, using machine learning techniques to process and generate language in a way that rivals human-level complexity and nuance. The rapid evolution and widespread impact of LLMs have become a global phenomenon, prompting discussions on their potential applications and implications [2,3,4,5]. Moreover, the introduction of chatbots like Chat Generative Pre-trained Transformer (ChatGPT), which use these large language models to generate conversations, has made it easier to utilize these models in a conversational format.

Within the realm of LLMs, the GPT series, in particular, has gained significant attention. Many applications have been explored within the field of radiology [6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]. Among these, the potential of GPT to assist in diagnosis from image findings is noteworthy [18,19,20] because such capabilities could complement the essential aspects of daily clinical practice and education. Two studies show the potential of GPT-4 to generate differential diagnosis in the field of neuroradiology [19, 20]. One study utilizes the “Case of the Week” from the American Journal of Neuroradiology [19], and the other study utilizes “Freiburg Neuropathology Case Conference” cases from the Clinical Neuroradiology journal [20]. Additionally, large language models like GPT-4 have shown differential diagnostic potential in subspecialties beyond the field of neuroradiology [6].

Although these pioneering investigations suggest that GPT-4 could play an important role in radiological diagnosis, there are no studies reporting evaluation using real-world radiology reports. Unlike quizzes [19, 20], which tend to present carefully curated, typical cases and are created by individuals already aware of the correct diagnosis, real-world radiology reports may contain less structured and more diverse information. This difference might lead to biased evaluations that do not reflect the complex nature of clinical radiology [22, 23].

To address this gap, our study examines the diagnostic abilities of GPT-4 using only real-world clinical radiology reports. Large language models like GPT-4 are often utilized in various fields through chatbots such as ChatGPT. Therefore, we specifically evaluated the diagnostic capabilities of GPT-4-based ChatGPT in real-world clinical settings to see how effectively it can diagnose medical conditions. We zeroed in on MRI reports pertaining to brain tumors, given the pivotal role radiological reports play in determining treatment routes such as surgery, medication, or monitoring; and that pathological outcomes offer a definitive ground truth for brain tumors [24]. We compare the performance of GPT-4 with that of neuroradiologists and general radiologists, aiming to provide a more comprehensive view. Through this investigation, we aim to uncover the capabilities and potential limitations of GPT-4 as a diagnostic tool in a real-world clinical setting. In our daily clinical practice, thinking through differential and final diagnoses can be challenging and time-consuming. If GPT-4 can excel in this diagnostic process, it indicates potential value in clinical scenarios.

Methods

Study design

In this retrospective study, GPT-4-based ChatGPT was presented with imaging findings from our real reports and asked to suggest differential and final diagnoses. For a fair comparison, we also presented the same image findings in text form to radiologists and requested differential diagnoses and a final diagnosis. The protocol of this study was reviewed and approved (approval no. 2023-015) by the Ethical Committee of Osaka Metropolitan University Graduate School of Medicine. This study was conducted in accordance with the Declaration of Helsinki. The requirement for informed consent was waived because the radiology reports had been acquired during daily clinical practice. The design of this study is based on the Standards for Reporting of Diagnostic Accuracy Studies (STARD) guideline [25].

Radiology experts

In this study, three neuroradiologists and four general radiologists were selected. Neuroradiologists were radiologists certified by the Japanese Society of Radiology as specialists in diagnostic imaging, specializing in the central nervous system. General radiologists were defined as radiology residents or radiologists who specialize in areas other than imaging diagnosis. One neuroradiologist and one general radiologist reviewed the collected findings, while the other two neuroradiologists and three general radiologists conducted the reading test.

Data collection

In this study, we consecutively collected brain MRI image findings of preoperative brain tumors from radiological reports taken at Osaka Metropolitan University Hospital (Institution A) from January 2021 to December 2021 and National Hospital Organization Osaka Minami Medical Center (Institution B) from January 2017 to December 2021. These imaging findings were subsequently verified by a neuroradiologist (a board-certified radiologist with 8 years of experience) and a general radiologist (a radiology resident with 4 years of experience). When a diagnosis was described in the imaging findings, it was also removed to avoid data leakage. Any descriptions related to previous imaging findings and unrelated image descriptors (such as ‘Image 1’), were deleted. The report writer (neuroradiologist or general radiologist) was noted.

In- and output procedure for GPT-4-based ChatGPT

All MRI reports were originally written in Japanese and translated into English by a general radiologist (a radiology resident with 4 years of experience). A neuroradiologist (a board-certified radiologist with 8 years of experience) verified that there was no loss of information in the translation. Both radiologists use English in their daily practice. Before each MRI report, the same prompt was processed consecutively in a single conversation. This prompt uses a closed-ended and zero-shot prompting approach. Based on the prompts from previous studies [18, 26], our prompt has been modified to specify that the input findings are from head MRI and to request three differential diagnoses ranked in order of likelihood. A neuroradiologist (a board-certified radiologist with 8 years of experience) and a general radiologist (a radiology resident with 4 years of experience) verified that ChatGPT, when given this prompt, ranks three differential diagnoses. We input the following premise into ChatGPT based on the GPT-4 architecture (May 24 version; OpenAI, California, USA; https://chat.openai.com/): “List three possible differential diagnoses in order of likelihood from the following head MRI findings.” Then, we input the imaging findings created during clinical practice and received three differential diagnoses from GPT-4. The diagnosis listed highest among the three differential diagnoses was determined to be the final diagnosis. An example of the input to ChatGPT and the output of ChatGPT is shown in Fig. 1. The information of the report writers was not provided to GPT-4.

Fig. 1
figure 1

Examples of interface with ChatGPT. These are input texts to ChatGPT and output texts generated by ChatGPT. The diagnosis listed highest among the three differential diagnoses was determined to be the final diagnosis

Radiologist reading test

We provided the same image findings that were input into GPT-4 to two neuroradiologists (A: a board-certified radiologist with 13 years of experience, B: a board-certified radiologist with 8 years of experience) and three general radiologists (C: a radiology resident with 4 years of experience, D: a radiology resident with 3 years of experience, and E: a radiology resident with 2 years of experience). Readers’ years of experience and specialty certification are shown in Table 1. These two neuroradiologists and three general radiologists were different from the radiologists who verified the image findings during data collection. They read only these text findings and provided three differential diagnoses including one final diagnosis. Two neuroradiologists and three general radiologists were blind to the information of the report writers.

Table 1 Profile of human raters

Output evaluation

We utilized the pathological diagnosis of the tumor that was excised in neurosurgery as the ground truth. A neuroradiologist (a board-certified radiologist with 8 years of experience) and a general radiologist (a radiology resident with 4 years of experience) confirmed whether the differential diagnoses and final diagnosis suggested by both the LLM output and the interpretations of the neuroradiologists and general radiologists were aligned with the pathological diagnosis.

Statistical analysis

We computed the accuracy of both the differential and final diagnoses made by GPT-4 and those of two neuroradiologists and three general radiologists. To compare the diagnostic accuracy of the differential and final diagnoses between GPT-4 and each radiologist, we conducted McNemar’s test. Additionally, we calculated these accuracies separately for when the reporter was a neuroradiologist and when the reporter was a general radiologist, to examine how the quality of input (image findings) affects the diagnoses both by GPT-4 and radiologists. Moreover, Fisher’s exact test was performed to compare the diagnostic accuracy of both GPT-4 and the five radiologists, resulting from the reports by neuroradiologist or general radiologist reporters. p-values less than 0.05 were considered significant. p-values were not corrected for multiple comparisons. These statistical tests were performed using R (version 4.3.1, 2023; R Foundation for Statistical Computing; https://R-project.org). We measured word counts before and after simplifying MRI findings by both reporter and institution. The mean byte count of MRI reports was assessed for each reporter and institution. We calculated Cohen’s kappa coefficient between GPT-4 and each radiologist. We grouped cases based on the number of radiologists (from 0 to 5) who correctly diagnosed each case and analyzed the accuracy of ChatGPT for the cases in each of the six groups.

Results

A total of 150 radiological reports were included in this research after excluding 96 reports according to the exclusion criteria. A data collection flowchart is shown in Fig. 2. Demographics of brain MRI cases are shown in Table 2. The word count of MRI findings by reporter and institution is shown in Supplementary Appendix Table 1. The average byte count of MRI reports by reporter and institution is shown in Supplementary Appendix Table 2.

Fig. 2
figure 2

Flowchart of data collection. This is the data collection flowchart

Table 2 Demographics of brain MRI cases

The accuracy for final and differential diagnoses by GPT-4, Neuroradiologists A, B, and General radiologists C, D, and E are shown in Table 3 and Fig. 3. In the final diagnosis, GPT-4 demonstrated diagnostic accuracy comparable to those of neuroradiologists and general radiologists. The accuracy rates were as follows: GPT-4: 73% (95% CI: 65, 79%), Neuroradiologist A: 65% (95% CI: 57, 72%), Neuroradiologist B: 79% (95% CI: 72, 85%), General radiologist C: 65% (95% CI: 57, 72%), General radiologist D: 73% (95% CI: 66, 80%), and General radiologist E: 65% (95% CI: 57, 72%). In differential diagnoses, GPT-4 showed diagnostic accuracy that surpassed those of both neuroradiologists and general radiologists. The accuracy rates were as follows: GPT-4: 94% (95% CI: 89, 97%), Neuroradiologist A: 87% (95% CI: 80, 91%), Neuroradiologist B: 89% (95% CI: 83, 93%), General radiologist C: 76% (95% CI: 69, 82%), General radiologist D: 83% (95% CI: 77, 88%), and General radiologist E: 73% (95% CI: 66, 80%).

Table 3 Accuracy of GPT-4 and radiologists’ diagnoses
Fig. 3
figure 3

Accuracy of GPT-4 and radiologists. The point plots with 95% confidence intervals represent the accuracy of GPT-4 and radiologists for the final and differential diagnoses, respectively. The blue, orange, and green plots indicate the accuracy of total report reading, neuroradiologist-writing report reading, and general radiologist-writing report reading, respectively

The accuracy per reporter for final and differential diagnoses by GPT-4, the two neuroradiologists, and the three general radiologists are shown in Table 4 and Fig. 3. In the final diagnosis, GPT-4 showed a statistically significant difference in diagnostic accuracy when using reports created by neuroradiologists and general radiologists. The accuracy rates were as follows: Neuroradiologist’s report: 80% (95% CI: 71, 87%), General radiologist’s report: 60% (95% CI: 47, 72%), p-value: 0.013. The Cohen’s Kappa scores presented in Table 5 indicate varying levels of agreement among different radiologists. Neuroradiologists A and B showed a higher agreement with each other (0.46) compared to their agreement rates with general radiologists, which ranged from 0.36 to 0.43. Among the general radiologists (C, D, E), the highest agreement was seen between radiologists C and D (0.49), and C and E (0.46), indicating that general radiologists also tend to have higher agreement rates with each other than with neuroradiologists. The accuracy of GPT-4, ranked by the number of correct responses from the five radiologists, is shown in Supplementary Appendix Table 3.

Table 4 Intra-rater comparison of the diagnostic accuracy between reports written by neuroradiologists and general radiologists
Table 5 Inter-rater reliability among GPT-4 and radiologists

Discussion

GPT-4 and five radiologists were presented with preoperative brain MRI findings from 150 cases and asked to list differential and final diagnoses. For final diagnoses, GPT-4’s accuracy was 73% (95% CI: 65, 79%). In comparison, neuroradiologists A through general radiologist E had accuracies of 65% (95% CI: 57, 72%), 79% (95% CI: 72, 85%), 65% (95% CI: 57, 72%), 73% (95% CI: 66, 80%), and 65% (95% CI: 57, 72%), respectively. For differential diagnoses, GPT-4 achieved 94% (95% CI: 89, 97%) accuracy, while the radiologists’ accuracies ranged from 73% (95% CI: 66, 80%) to 89% (95% CI: 83, 93%). In the final diagnoses, GPT-4 showed an accuracy of 80% (95% CI: 71, 87%) with reports from neuroradiologists, compared to 60% (95% CI: 47, 72%) with those from general radiologists, a statistically significant difference (p-value: 0.013). On the other hand, GPT-4’s differential diagnostic accuracy was 95% (95% CI: 88, 98%) with reports from neuroradiologists and 93% (95% CI: 83, 97%) with reports from general radiologists, not a statistically significant difference (p-value: 0.73). Cohen’s Kappa scores indicated an overall fair to moderate agreement rate. This suggests that even among neuroradiologists, there may have been many tasks prone to diagnostic disagreements. Additionally, it showed slightly higher agreement rates among physicians of the same specialty. That is, neuroradiologists had a higher agreement rate among themselves than with general radiologists, and general radiologists had a higher agreement rate among themselves than with neuroradiologists.

This study is the first attempt to evaluate GPT-4’s ability to interpret actual clinical radiology reports, rather than from settings like image diagnosis quizzes. The majority of previous research [6,7,8,9,10,11,12, 17,18,19,20,21] suggested the utility of GPT-4 in diagnostics, but these relied heavily on hypothetical environments such as quizzes from academic journals or examination questions [27]. This approach can lead to a cognitive bias since the individuals formulating the imaging findings or exam questions also possess the answers. In these simulated scenarios, there’s also a propensity to leave out minor findings. Such minor findings, while often deemed insignificant in an experimental setup, are frequently encountered in real-world clinical practice and can have implications for diagnosis. In contrast, our study deviates from this previous methodology by using actual clinical findings, generated in a state of diagnostic uncertainty. This approach facilitates a more robust and practical evaluation of GPT-4’s accuracy, keeping in mind its potential applications in real-world clinical settings.

The diagnostic accuracy of GPT-4 varied depending on whether the input report was written by a neuroradiologist or a general radiologist. Specifically, for the final diagnosis, using reports from the neuroradiologists yielded higher accuracy than using those from general radiologists. However, for differential diagnoses, there was no difference in accuracy, regardless of whether the report was from a neuroradiologist or a general radiologist. Neuroradiologists, due to their experience and specialized knowledge, are more likely to include comprehensive, detailed information necessary for a final diagnosis in their reports [28,29,30]. Such high-quality reports likely enhanced GPT-4’s accuracy for final diagnoses. Conversely, GPT-4 possesses the ability to provide accurate differential diagnoses even with the general radiologists report because they can capture certain information crucial for a diagnosis. From these findings, a beneficial application of GPT-4 in clinical and educational settings is for neuroradiologists to use it as a second opinion to assist with final and differential diagnoses. For general radiologists, GPT-4 can be particularly useful for understanding diagnostic cues and learning about differential diagnoses, which can sometimes be time-consuming. When general radiologists encounter complex or unfamiliar cases, consulting GPT-4 could guide their diagnostic direction. Of course, any advice or suggestions from GPT-4 should be considered as just one of many references. General radiologists should prioritize consultation with experts when determining the final diagnosis. In this paper, we compared the diagnostic capabilities from radiologist report texts between GPT-4 and radiologists themselves, and found that generic LLMs have significant potential as diagnostic decision support systems in radiology. If this potential was incorporated into a standard workflow, it is possible to reduce missed findings by consulting the ChatGPT output. This is a valuable future research opportunity.

There are several limitations. This study only used the wording of actual clinical radiology reports and did not evaluate the effect of including other information such as patient history and the image itself, meaning the radiologists’ performance might not match their real-world diagnostic abilities. Furthermore, recent advancements in large language models have enabled the input of not only text but also images. Evaluating the performance of large language models that combine both radiology report texts and images could provide deeper insights into their potential usefulness in radiology diagnostics. Among the two institutions where MRI reports were collected, institution A and the five radiologist readers (neuroradiologists A and B, general radiologists C, D, and E) were from the same institution, which could result in bias due to familiarity with the report style and writing. We have only evaluated the diagnostic performance of GPT-4 in a single language and would like to see it evaluated in multiple language reports. We did not assess MRI reports for diseases other than brain tumors.

GPT-4 has showcased a great diagnostic ability, demonstrating performance comparable to that of neuroradiologists in the task of diagnosing brain tumors from MRI reports. The implications of these findings are far-reaching, suggesting potential real-world utility, particularly in the generation of differential diagnoses for general radiologists in a clinical setting. The encouraging results of this study invite further evaluations of the LLM’s accuracy across a myriad of medical fields and imaging modalities. The end goal of such exploration is to pave the way for the development of more versatile, reliable, and powerful tools for healthcare.