Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports

Hasani, Amir M.; Singh, Shiva; Zahergivar, Aryan; Ryan, Beth; Nethala, Daniel; Bravomontenegro, Gabriela; Mendhiratta, Neil; Ball, Mark; Farhadi, Faraz; Malayeri, Ashkan

doi:10.1007/s00330-023-10384-x

Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports

Imaging Informatics and Artificial Intelligence
Published: 08 November 2023

(2023)
Cite this article

European Radiology Aims and scope Submit manuscript

Amir M. Hasani¹,
Shiva Singh²,
Aryan Zahergivar²,
Beth Ryan³,
Daniel Nethala³,
Gabriela Bravomontenegro³,
Neil Mendhiratta³,
Mark Ball³,
Faraz Farhadi² &
…
Ashkan Malayeri ORCID: orcid.org/0000-0002-4442-7343²

1573 Accesses
4 Citations
6 Altmetric
Explore all metrics

Abstract

Objective

Radiology reporting is an essential component of clinical diagnosis and decision-making. With the advent of advanced artificial intelligence (AI) models like GPT-4 (Generative Pre-trained Transformer 4), there is growing interest in evaluating their potential for optimizing or generating radiology reports. This study aimed to compare the quality and content of radiologist-generated and GPT-4 AI-generated radiology reports.

Methods

A comparative study design was employed in the study, where a total of 100 anonymized radiology reports were randomly selected and analyzed. Each report was processed by GPT-4, resulting in the generation of a corresponding AI-generated report. Quantitative and qualitative analysis techniques were utilized to assess similarities and differences between the two sets of reports.

Results

The AI-generated reports showed comparable quality to radiologist-generated reports in most categories. Significant differences were observed in clarity (p = 0.027), ease of understanding (p = 0.023), and structure (p = 0.050), favoring the AI-generated reports. AI-generated reports were more concise, with 34.53 fewer words and 174.22 fewer characters on average, but had greater variability in sentence length. Content similarity was high, with an average Cosine Similarity of 0.85, Sequence Matcher Similarity of 0.52, BLEU Score of 0.5008, and BERTScore F1 of 0.8775.

Conclusion

The results of this proof-of-concept study suggest that GPT-4 can be a reliable tool for generating standardized radiology reports, offering potential benefits such as improved efficiency, better communication, and simplified data extraction and analysis. However, limitations and ethical implications must be addressed to ensure the safe and effective implementation of this technology in clinical practice.

Clinical relevance statement

The findings of this study suggest that GPT-4 (Generative Pre-trained Transformer 4), an advanced AI model, has the potential to significantly contribute to the standardization and optimization of radiology reporting, offering improved efficiency and communication in clinical practice.

Key Points

• Large language model–generated radiology reports exhibited high content similarity and moderate structural resemblance to radiologist-generated reports.

• Performance metrics highlighted the strong matching of word selection and order, as well as high semantic similarity between AI and radiologist-generated reports.

• Large language model demonstrated potential for generating standardized radiology reports, improving efficiency and communication in clinical settings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on large language model based autonomous agents

Article Open access 22 March 2024

Revolutionizing healthcare: the role of artificial intelligence in clinical practice

Article Open access 22 September 2023

Explainable AI: A Brief Survey on History, Research Areas, Approaches and Challenges

Abbreviations

AI:: Artificial intelligence
BERT:: Bidirectional Encoder Representations from Transformers
BLEU:: Bilingual Evaluation Understudy
CT:: Computed tomography
GPT-4:: Generative Pre-trained Transformer 4
LLMs:: Large language models
MRI:: Magnetic resonance imaging
NLG:: Natural language generation
NLP:: Natural language processing
PACS:: Picture Archiving and Communication Systems
SDV:: Standard deviation
TF-IDF:: Term Frequency–Inverse Document Frequency

References

Srinivasa Babu A, Brooks ML (2015) The malpractice liability of radiology reports: minimizing the risk. Radiographics 35:547–554
Article PubMed Google Scholar
Larson DB (2018) Strategies for implementing a standardized structured radiology reporting program. Radiographics 38:1705–1716
Article PubMed Google Scholar
Adams LC, Truhn D, Busch F et al (2023) Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a multilingual feasibility study. Radiology. https://doi.org/10.1148/radiol.230725:230725
Article PubMed Google Scholar
Jeblick K, Schachtner B, Dexl J et al (2023) ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur Radiol. https://doi.org/10.1007/s00330-023-10213-1
Article PubMed Google Scholar
Gaube S, Suresh H, Raue M et al (2021) Do as AI say: susceptibility in deployment of clinical decision-aids. NPJ Digit Med 4:31
Rao A, Kim J, Kamineni M, Pang M, Lie W, Succi MD (2023) Evaluating ChatGPT as an adjunct for radiologic decision-making. medRxiv. https://doi.org/10.1101/2023.02.02.23285399
Choudhury A, Asan O (2020) Role of artificial intelligence in patient safety outcomes: systematic literature review. JMIR Med Inform 8:e18599
Article PubMed PubMed Central Google Scholar
Aggarwal R, Sounderajah V, Martin G et al (2021) Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digit Med 4:65
Article PubMed PubMed Central Google Scholar
Alfarghaly O, Khaled R, Elkorany A, Helal M, Fahmy A (2021) Automated radiology report generation using conditioned transformers. Inform Med Unlocked 24:100557
Article Google Scholar
Monshi MMA, Poon J, Chung V (2020) Deep learning in generating radiology reports: a survey. Artif Intell Med 106:101878
Article PubMed PubMed Central Google Scholar
Wiggins WF, Kitamura F, Santos I, Prevedello LM (2021) Natural language processing of radiology text reports: interactive text classification. Radiol Artif Intell 3:e210035
Article PubMed PubMed Central Google Scholar
Lee P, Bubeck S, Petro J (2023) Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N Engl J Med 388:1233–1239
Article PubMed Google Scholar
Mabotuwana T, Lee MC, Cohen-Solal EV (2013) An ontology-based similarity measure for biomedical data – application to radiology reports. J Biomed Inform 46:857–868
Article PubMed Google Scholar
Lyu Q, Tan J, Zapadka ME et al (2023) Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: promising results, limitations, and potential. https://doi.org/10.48550/arXiv.2303.09038
Jing B, Xie P, Xing EP (2017) On the automatic generation of medical imaging reports Annual meeting of the Association for Computational Linguistics
Tejani AS, Ng YS, Xi Y, Fielding JR, Browning TG, Rayan JC (2022) Performance of multiple pretrained BERT models to automate and accelerate data annotation for large datasets. Radiol Artif Intell 4:e220007
Article PubMed PubMed Central Google Scholar
Yan A, McAuley J, Lu X et al (2022) RadBERT: adapting transformer-based language models to radiology. Radiol Artif Intell 4:e210258
Article PubMed PubMed Central Google Scholar
Li J, Lin Y, Zhao P et al (2022) Automatic text classification of actionable radiology reports of tinnitus patients using bidirectional encoder representations from transformer (BERT) and in-domain pre-training (IDPT). BMC Med Inform Decis Mak 22:200
Article PubMed PubMed Central Google Scholar
Nishigaki D, Suzuki Y, Wataya T et al (2023) BERT-based transfer learning in sentence-level anatomic classification of free-text radiology reports. Radiol Artif Intell 5:e220097
Article PubMed PubMed Central Google Scholar
Olthof AW, Shouche P, Fennema EM et al (2021) Machine learning based natural language processing of radiology reports in orthopaedic trauma. Comput Methods Programs Biomed 208:106304
Article CAS PubMed Google Scholar
OpenAI (2023) GPT-4 Technical Report. Arxiv abs/2303.08774
Alkaissi H, McFarlane SI (2023) Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus 15:e35179
PubMed PubMed Central Google Scholar
Li J, Cheng X, Zhao WX, Nie J-Y, Wen J-R (2023) HELMA: a large-scale hallucination evaluation benchmark for large language models. arXiv preprint arXiv:230511747

Download references

Funding

This study has received funding by National Institutes of Health.

Author information

Authors and Affiliations

Laboratory of Translation Research, National Heart Blood Lung Institute, NIH, Bethesda, MD, USA
Amir M. Hasani
Radiology & Imaging Sciences Department, Clinical Center, NIH, Bethesda, MD, USA
Shiva Singh, Aryan Zahergivar, Faraz Farhadi & Ashkan Malayeri
Urology Oncology Branch, National Cancer Institute, NIH, Bethesda, MD, USA
Beth Ryan, Daniel Nethala, Gabriela Bravomontenegro, Neil Mendhiratta & Mark Ball

Authors

Amir M. Hasani
View author publications
You can also search for this author in PubMed Google Scholar
Shiva Singh
View author publications
You can also search for this author in PubMed Google Scholar
Aryan Zahergivar
View author publications
You can also search for this author in PubMed Google Scholar
Beth Ryan
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Nethala
View author publications
You can also search for this author in PubMed Google Scholar
Gabriela Bravomontenegro
View author publications
You can also search for this author in PubMed Google Scholar
Neil Mendhiratta
View author publications
You can also search for this author in PubMed Google Scholar
Mark Ball
View author publications
You can also search for this author in PubMed Google Scholar
Faraz Farhadi
View author publications
You can also search for this author in PubMed Google Scholar
Ashkan Malayeri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ashkan Malayeri.

Ethics declarations

Guarantor

The scientific guarantor of this publication is Ashkan Malayeri.

Conflict of interest

The authors of this manuscript declare no relationships with any companies, whose products or services may be related to the subject matter of the article.

Statistics and biometry

No complex statistical methods were necessary for this paper.

Informed consent

Written informed consent was not required for this study because radiology reports were anonymized.

Ethical approval

The National Institutes of Health’s Institutional Review Board (IRB) approved the protocol as a retrospective study.

Study subjects or cohorts overlap

N/A.

Methodology

• Comparative quantitative and qualitative analysis

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hasani, A.M., Singh, S., Zahergivar, A. et al. Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports. Eur Radiol (2023). https://doi.org/10.1007/s00330-023-10384-x

Download citation

Received: 11 July 2023
Revised: 01 September 2023
Accepted: 08 September 2023
Published: 08 November 2023
DOI: https://doi.org/10.1007/s00330-023-10384-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports