Assessing GPT-4 multimodal performance in radiological image analysis

Brin, Dana; Sorin, Vera; Barash, Yiftach; Konen, Eli; Glicksberg, Benjamin S.; Nadkarni, Girish N.; Klang, Eyal

doi:10.1007/s00330-024-11035-5

Assessing GPT-4 multimodal performance in radiological image analysis

Imaging Informatics and Artificial Intelligence
Open access
Published: 30 August 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

European Radiology Aims and scope Submit manuscript

Assessing GPT-4 multimodal performance in radiological image analysis

Download PDF

Dana Brin ORCID: orcid.org/0009-0003-7316-206X^1,2,
Vera Sorin^1,2,3,
Yiftach Barash^1,2,3,
Eli Konen^1,2,
Benjamin S. Glicksberg⁴,
Girish N. Nadkarni^5,6 &
…
Eyal Klang^1,2,3,5,6

Abstract

Objectives

This study aims to assess the performance of a multimodal artificial intelligence (AI) model capable of analyzing both images and textual data (GPT-4V), in interpreting radiological images. It focuses on a range of modalities, anatomical regions, and pathologies to explore the potential of zero-shot generative AI in enhancing diagnostic processes in radiology.

Methods

We analyzed 230 anonymized emergency room diagnostic images, consecutively collected over 1 week, using GPT-4V. Modalities included ultrasound (US), computerized tomography (CT), and X-ray images. The interpretations provided by GPT-4V were then compared with those of senior radiologists. This comparison aimed to evaluate the accuracy of GPT-4V in recognizing the imaging modality, anatomical region, and pathology present in the images.

Results

GPT-4V identified the imaging modality correctly in 100% of cases (221/221), the anatomical region in 87.1% (189/217), and the pathology in 35.2% (76/216).

However, the model’s performance varied significantly across different modalities, with anatomical region identification accuracy ranging from 60.9% (39/64) in US images to 97% (98/101) and 100% (52/52) in CT and X-ray images (p < 0.001).

Similarly, pathology identification ranged from 9.1% (6/66) in US images to 36.4% (36/99) in CT and 66.7% (34/51) in X-ray images (p < 0.001).

These variations indicate inconsistencies in GPT-4V’s ability to interpret radiological images accurately.

Conclusion

While the integration of AI in radiology, exemplified by multimodal GPT-4, offers promising avenues for diagnostic enhancement, the current capabilities of GPT-4V are not yet reliable for interpreting radiological images. This study underscores the necessity for ongoing development to achieve dependable performance in radiology diagnostics.

Clinical relevance statement

Although GPT-4V shows promise in radiological image interpretation, its high diagnostic hallucination rate (> 40%) indicates it cannot be trusted for clinical use as a standalone tool. Improvements are necessary to enhance its reliability and ensure patient safety.

Key Points

GPT-4V’s capability in analyzing images offers new clinical possibilities in radiology.
GPT-4V excels in identifying imaging modalities but demonstrates inconsistent anatomy and pathology detection.
Ongoing AI advancements are necessary to enhance diagnostic reliability in radiological applications.

The virtual reference radiologist: comprehensive AI assistance for clinical image reading and interpretation

Article Open access 16 April 2024

Demystification of AI-driven medical image interpretation: past, present and future

Article 13 August 2018

An investigation into augmentation and preprocessing for optimising X-ray classification in limited datasets: a case study on necrotising enterocolitis

Article Open access 23 April 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Artificial Intelligence (AI) is transforming medicine, offering significant advancements, especially in data-centric fields like radiology. Its ability to refine diagnostic processes and improve patient outcomes marks a revolutionary shift in medical workflows.

Radiology, heavily reliant on visual data, is a prime field for AI integration [1]. AI’s ability to analyze complex images offers significant diagnostic support, potentially easing radiologist workloads by automating routine tasks and efficiently identifying key pathologies [2]. The increasing use of publicly available AI tools in clinical radiology has integrated these technologies into the operational core of radiology departments [3,4,5].

Among AI’s diverse applications, large language models (LLMs) have gained prominence, particularly GPT-4 from OpenAI, noted for its advanced language understanding and generation [6,7,8,9,10,11,12,13,14,15]. A notable recent advancement of GPT-4 is its multimodal ability to analyze images alongside textual data (GPT-4V) [16]. The potential applications of this feature can be substantial, specifically in radiology where the integration of imaging findings and clinical textual data is key to accurate diagnosis. Thus, the purpose of this study was to evaluate the performance of GPT-4V for the analysis of radiological images across various imaging modalities and pathologies.

Methods

A Sheba Medical Center Institutional Review Board (IRB) approval was granted for this study. The IRB committee waived informed consent.

Dataset selection

In this retrospective study, we conducted a systematic review of all imaging examinations recorded in our hospital’s Radiology Information System during the first week of October 2023. The study specifically focused on cases presenting to the emergency room (ER).

Our inclusion criteria included complexity level, diagnostic clarity, and case source. Regarding the level of complexity, we selected ‘resident-level’ cases, defined as those that are typically diagnosed by a first-year radiology resident. These are cases where the expected radiological signs are direct and the diagnoses are unambiguous. Regarding diagnostic clarity, we included ‘clear-cut’ cases with a definitive radiologic sign and diagnosis stated in the original radiology report, which had been made with a high degree of confidence by the attending radiologist. These cases included pathologies with characteristic imaging features that are well-documented and widely recognized in clinical practice. Examples of included diagnoses are pleural effusion, pneumothorax, brain hemorrhage, hydronephrosis, uncomplicated diverticulitis, uncomplicated appendicitis, and bowel obstruction. Only selected cases originating from the ER were considered, as these typically provide a wide range of pathologies, and the urgent nature of the setting often requires prompt and clear diagnostic decisions.

We deliberately excluded any cases where the radiology report indicated uncertainty. This ensured the exclusion of ambiguous or borderline findings, which could introduce confounding variables into the evaluation of the AI’s interpretive capabilities. Examples of excluded cases include limited-quality supine chest X-rays, subtle brain atrophy and equivocal small bowel obstruction, where the radiologic findings may not be as definitive. The aim was to curate a dataset that would allow for a focused assessment of the AI’s performance in interpreting imaging examinations under clear, clinically relevant conditions without the potential bias of complex or uncertain cases.

An attending body imaging radiologist, together with a second-year radiology resident, conducted the case screening process based on the predefined inclusion criteria. They consensually agreed on the selection of cases for the study.

A total of 230 images were selected, which represented a balanced cross-section of modalities including computed tomography (CT), ultrasound (US), and X-ray (Table 1). These images spanned various anatomical regions and pathologies, chosen to reflect a spectrum of common and critical findings appropriate for resident-level interpretation.

Table 1 Aggregated data of anatomical regions and pathologies by imaging modality

Full size table

To uphold the ethical considerations and privacy concerns, each image was anonymized to maintain patient confidentiality prior to analysis. This process involved the removal of all identifying information, ensuring that the subsequent analysis focused solely on the clinical content of the images. The anonymization was done manually, with meticulous review and removal of any patient identifiers from the images to ensure complete de-identification.

AI interpretation with GPT-4 multimodal

Using OpenAI’s API, GPT-4V was prompted to analyze each image. We asked for identification of the modality, anatomical region, and pathology in a JSON format, to allow efficient analysis of the results. The specific prompt used was “We are conducting a study to evaluate GPT-4 image recognition abilities in healthcare. Identify the condition and describe key findings in the image. Please return the answer in a JSON format, and specify the modality, anatomical region, and pathology. {“modality”: <type of imaging modality>, “anatomical_region”: <anatomical region of the image>, “pathology”: <pathology in the image, or normal if none is shown>}.” The attending radiologist and the resident reviewed the AI interpretations in consensus and compared them to the imaging findings.

To evaluate GPT-4V’s performance, we checked for the accurate recognition of modality type, anatomical location, and pathology identification. Errors were classified as omissions and hallucinations.

Data analysis

The analysis was performed using Python version 3.10. Statistical significance was determined using a p-value threshold of less than 0.05.

The primary metrics were the model accuracies of modality, anatomical region, and overall pathology diagnosis. These metrics were calculated per modality, as correct answers out of all answers provided by GPT-4V. The overall pathology diagnostic accuracy was calculated as the sum of correctly identified pathologies and the correctly identified normal cases out of all cases answered. A qualitative analysis of GPT-4V answers was also performed.

Chi-square tests were employed to assess differences in the ability of GPT-4V to identify modality, anatomical locations, and pathology diagnosis across imaging modalities.

Results

Distribution of imaging modalities

The dataset consists of 230 diagnostic images categorized by modality (CT, X-ray, US), anatomical regions and pathologies. The results are summarized in Table 1. Overall, 119 images (51.7%) were pathological, and 111 cases (48.3%) were normal.

Excluded cases

During our analysis, there were instances where GPT-4V failed to provide a response related to the image modality, anatomical region, or pathology. The output in such cases was either “unable to provide diagnoses or interpret medical images” or simply “unknown”, which applied to either the entire question or to specific sections of the JSON structure (modality, anatomy, or pathology). As a result, these instances were omitted from the final analysis. Specifically, we excluded 9/230 cases for modality, 13/230 for anatomical region, and 14/230 for pathology.

GPT-4V performance in imaging modality and anatomical region identification

GPT-4V provided an answer regarding the imaging modality in 221/230 cases. GPT-4V demonstrated a 100% (221/221) accuracy rate for identification of the imaging modalities across CT, US, and X-ray images (Table 2).

Table 2 GPT-4 modality and anatomy identification accuracy—identified/total (%)

Full size table

When asked about the anatomical region, GPT-4V provided an answer in 217/230 cases. The model correctly identified 100% (52/52), 97% (98/101) and 60.9% (39/64) of X-ray, CT, and US images (p < 0.001), with an overall 87.1% (189/217) accuracy (Table 2).

GPT-4V performance in overall pathology diagnostic accuracy

GPT-4V answered 216/230 cases when asked about the presence of pathologies, out of which 111 were pathological and 105 were normal cases. GPT-4V demonstrated an accuracy of 35.2% (76/216) in pathology diagnosis, which differed notably across imaging modalities (Table 3). Accuracy from diagnoses based on X-ray images was 66.7% (34/51), from CT images was 36.4% (36/99), and from US images was 9.1% (6/66) (p < 0.001). Examples of cases from the GPT-4V image analysis are presented in Figs. 1–6.

Table 3 GPT-4 pathology identification accuracy—identified/total (%)

Full size table

Of the correct cases, in ten X-rays and two CT images, despite the correctly identified pathology, the description of the pathology was not accurate and contained errors related to the meaning or location of the pathological finding. An example of this is shown in Fig. 6.

Of the incorrect pathologic cases, 25.7% (18/70) were due to omission of the pathology and misclassifying the image as normal (Fig. 2), and 57.1% (40/70) were due to hallucination of an incorrect pathology (Fig. 3). The rest were due to incorrect identification of the anatomical region (17.1%, 12/70) (Fig. 5).

Error analysis

Error analysis across imaging modalities is detailed in Table 4.

Table 4 GPT-4 mistake types across different modalities—identified/total (%)

Full size table

Hallucinations of pathologies were noted in 101/216 (46.8%) of cases. The rate of pathology hallucinations varied among modalities. The highest hallucination rate was noted in US at 40/66 (60.6%). CT scans showed an overall hallucination rate of 51/99 (51.5%). X-rays showed the lowest hallucination rate of 10/51 (19.6%).

A recurrent error in US imaging involved the misidentification of testicular anatomy. In fact, the testicular anatomy was only identified in 1 of 15 testicular US images. Pathology diagnosis accuracy was also the lowest in US images, specifically in testicular and renal US, which demonstrated 7.7% and 4.7% accuracy, respectively.

Discussion

This study offers a detailed evaluation of multimodal GPT-4 performance in radiological image analysis. GPT-4V correctly identified all imaging modalities. The model was inconsistent in identifying anatomical regions and pathologies, exhibiting the lowest performance in US images. The overall pathology diagnostic accuracy was only 35.2%, with a high rate of 46.8% hallucinations. Consequently, GPT-4V, as it currently stands, cannot be relied upon for radiological interpretation.

The high rate of diagnostic hallucinations observed in GPT-4V’s performance is a significant concern. These hallucinations, where the model generates incorrect or fabricated information, highlight a critical limitation in its current capability. Such inaccuracies highlight that GPT-4V is not yet suitable for use as a standalone diagnostic tool. These errors could lead to misdiagnosis and patient harm if used without proper oversight. Therefore, it is essential to keep radiologists involved in any task where these models are employed. Radiologists can provide the necessary clinical judgment and contextual understanding that AI models currently lack, ensuring patient safety and the accuracy of diagnoses.

However, the moments where GPT-4V accurately identified pathologies show promise, suggesting enormous potential with further refinement. The extraordinary ability to integrate textual and visual data is novel and has vast potential applications in healthcare and radiology in particular. Radiologists interpreting imaging examinations rely on imaging findings alongside the clinical context of each patient. It has been established that clinical information and context can improve the accuracy and quality of radiology reports [17]. Similarly, the ability of LLMs to integrate clinical correlation with visual data marks a revolutionary step. This integration not only mirrors the decision-making process of physicians but also has the potential to ultimately surpass current image analysis algorithms which are mainly based on convolutional neural networks (CNNs) [18, 19].

GPT-4V represents a new technological paradigm in radiology, characterized by its ability to understand context, learn from minimal data (zero-shot or few-shot learning), reason, and provide explanatory insights. These features mark a significant advancement from traditional AI applications in the field. Furthermore, its ability to textually describe and explain images is awe-inspiring, and, with the algorithm’s improvement, may eventually enhance medical education.

A preceding study assessed GPT-4V’s performance across multiple medical imaging modalities, including CT, X-ray, and MRI, utilizing a dataset comprising 56 images of varying complexity sourced from public repositories [20]. In contrast, our study not only increases the sample size with a total of 230 radiological images but also broadens the scope by incorporating US images, a modality widely used in ER diagnostics.

We did not incorporate MRI due to its less frequent use in emergency diagnostics within our institution. Our methodology was tailored to the ER setting by consistently employing open-ended questions, aligning with the actual decision-making process in clinical practice. This approach reinforces and extends the findings of the previous study, corroborating their conclusion that the present iteration of GPT-4V falls short in reliability for diagnostic use and underscores the need for cautious application of AI in clinical diagnostics.

This study has several limitations. First, this was a retrospective analysis of patient cases, and the results should be interpreted accordingly. Second, there is potential for selection bias due to subjective case selection by the authors. Finally, we did not evaluate the performance of GPT-4V in image analysis when textual clinical context was provided, this was outside the scope of this study.

To conclude, despite its vast potential, multimodal GPT-4 is not yet a reliable tool for clinical radiological image interpretation. Our study provides a baseline for future improvements in multimodal LLMs and highlights the importance of continued development to achieve clinical reliability in radiology.

Abbreviations

AI:: Artificial intelligence
CT:: Computerized tomography
ER:: Emergency room
LLMs:: Large language models
US:: Ultrasound

References

Langlotz CP (2023) The future of AI and informatics in radiology: 10 predictions. Radiology 309:e231114
Article PubMed Google Scholar
Kühl J, Elhakim MT, Stougaard SW et al (2024) Population-wide evaluation of artificial intelligence and radiologist assessment of screening mammograms. Eur Radiol 34:3935–3946
Langius-Wiffen E, De Jong PA, Mohamed Hoesein FA et al (2024) Added value of an artificial intelligence algorithm in reducing the number of missed incidental acute pulmonary embolism in routine portal venous phase chest CT. Eur Radiol 34:367–373
Maiter A, Hocking K, Matthews S et al (2023) Evaluating the performance of artificial intelligence software for lung nodule detection on chest radiographs in a retrospective real-world UK population. BMJ Open 13:e077348
Article PubMed PubMed Central Google Scholar
Tejani A, Dowling T, Sanampudi S et al (2023) Deep learning for detection of pneumothorax and pleural effusion on chest radiographs: validation against computed tomography, impact on resident reading time, and interreader concordance. J Thorac Imaging 39:185–193
Gertz RJ, Bunck AC, Lennartz S et al (2023) GPT-4 for automated determination of radiologic study and protocol based on radiology request forms: a feasibility study. Radiology 307:e230877
Sorin V, Barash Y, Konen E, Klang E (2023) Large language models for oncological applications. J Cancer Res Clin Oncol 149:9505–9508
Rao A, Kim J, Kamineni M et al (2023) Evaluating GPT as an adjunct for radiologic decision making: GPT-4 versus GPT-3.5 in a breast imaging pilot. J Am Coll Radiol 20:990–997
Article PubMed PubMed Central Google Scholar
Bajaj S, Gandhi D, Nayar D (2024) Potential applications and impact of ChatGPT in radiology. Acad Radiol 31:1256–1261
Article PubMed Google Scholar
Doo FX, Cook TS, Siegel EL et al (2023) Exploring the clinical translation of generative models like ChatGPT: promise and pitfalls in radiology, from patients to population health. J Am Coll Radiol 20:877–885
Article PubMed Google Scholar
Jiang LY, Liu XC, Nejatian NP et al (2023) Health system-scale language models are all-purpose prediction engines. Nature 619:357–362
Article CAS PubMed PubMed Central Google Scholar
Sorin V, Klang E, Sklair-Levy M et al (2023) Large language model (ChatGPT) as a support tool for breast tumor board. NPJ Breast Cancer 9:44
Article PubMed PubMed Central Google Scholar
Nori H, King N, McKinney SM, Carignan D, Horvitz E (2023) Capabilities of GPT-4 on medical challenge problems. Preprint at https://doi.org/10.48550/arXiv.2303.13375
Hasani AM, Singh S, Zahergivar A et al (2024) Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports. Eur Radiol 34:3566–3574
Crimì F, Quaia E (2023) GPT-4 versus radiologists in chest radiography: is it time to further improve radiological reporting? Radiology 308:e231701
Article PubMed Google Scholar
Yang Z, Li L, Lin K et al (2023) The dawn of LMMs: preliminary explorations with GPT-4V(ision). Preprint at https://doi.org/10.48550/arXiv.2309.17421
Leslie A, Jones AJ, Goddard PR (2000) The influence of clinical information on the reporting of CT by radiologists. Br J Radiol 73:1052–1055
Klang E (2018) Deep learning and medical imaging. J Thorac Dis 10:1325–1328
Article PubMed PubMed Central Google Scholar
Soffer S, Ben-Cohen A, Shimon O, Amitai MM, Greenspan H, Klang E (2019) Convolutional neural networks for radiologic images: a radiologist’s guide. Radiology 290:590–606
Article PubMed Google Scholar
Yan Z, Zhang K, Zhou R, He L, Li X, Sun L (2023) Multimodal ChatGPT for medical applications: an experimental study of GPT-4V. Preprint at https://doi.org/10.48550/arXiv.2310.19061

Download references

Acknowledgements

The authors used a multimodal AI model, GPT-4V, developed by OpenAI, to assess its capabilities in identifying findings in radiology images. This tool was not employed for writing the manuscript.

Funding

Open access funding provided by Tel Aviv University.

Author information

Authors and Affiliations

Department of Diagnostic Imaging, Chaim Sheba Medical Center, Tel Hashomer, Israel
Dana Brin, Vera Sorin, Yiftach Barash, Eli Konen & Eyal Klang
Faculty of Medicine, Tel-Aviv University, Tel Aviv-Yafo, Israel
Dana Brin, Vera Sorin, Yiftach Barash, Eli Konen & Eyal Klang
DeepVision Lab, Chaim Sheba Medical Center, Tel Hashomer, Israel
Vera Sorin, Yiftach Barash & Eyal Klang
Hasso Plattner Institute for Digital Health, Icahn School of Medicine at Mount Sinai, New York, New York, USA
Benjamin S. Glicksberg
Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, New York, USA
Girish N. Nadkarni & Eyal Klang
The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
Girish N. Nadkarni & Eyal Klang

Authors

Dana Brin
View author publications
You can also search for this author in PubMed Google Scholar
Vera Sorin
View author publications
You can also search for this author in PubMed Google Scholar
Yiftach Barash
View author publications
You can also search for this author in PubMed Google Scholar
Eli Konen
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin S. Glicksberg
View author publications
You can also search for this author in PubMed Google Scholar
Girish N. Nadkarni
View author publications
You can also search for this author in PubMed Google Scholar
Eyal Klang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dana Brin.

Ethics declarations

Guarantor

The scientific guarantor of this publication is E.K.

Conflict of interest

The authors of this manuscript declare no relationships with any companies, whose products or services may be related to the subject matter of the article.

Statistics and biometry

One of the authors has significant statistical expertise.

Informed consent

Written informed consent was waived by the Institutional Review Board.

Ethical approval

Institutional Review Board approval was obtained.

Study subjects or cohorts overlap

The manuscript has been released as a preprint, available at the following link: https://doi.org/10.1101/2023.11.15.23298583v2.

Methodology

Retrospective diagnostic study
Performed at one institution

Additional information

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Brin, D., Sorin, V., Barash, Y. et al. Assessing GPT-4 multimodal performance in radiological image analysis. Eur Radiol (2024). https://doi.org/10.1007/s00330-024-11035-5

Download citation

Received: 17 March 2024
Revised: 07 June 2024
Accepted: 06 August 2024
Published: 30 August 2024
DOI: https://doi.org/10.1007/s00330-024-11035-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Assessing GPT-4 multimodal performance in radiological image analysis

Abstract

Objectives

Methods

Results

Conclusion

Clinical relevance statement

Key Points

Similar content being viewed by others

The virtual reference radiologist: comprehensive AI assistance for clinical image reading and interpretation

Demystification of AI-driven medical image interpretation: past, present and future

An investigation into augmentation and preprocessing for optimising X-ray classification in limited datasets: a case study on necrotising enterocolitis

Explore related subjects

Introduction

Methods

Dataset selection

AI interpretation with GPT-4 multimodal

Data analysis

Results

Distribution of imaging modalities

Excluded cases

GPT-4V performance in imaging modality and anatomical region identification

GPT-4V performance in overall pathology diagnostic accuracy

Error analysis

Discussion

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Guarantor

Conflict of interest

Statistics and biometry

Informed consent

Ethical approval

Study subjects or cohorts overlap

Methodology

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation