Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4

Krusche, Martin; Callhoff, Johnna; Knitza, Johannes; Ruffer, Nikolas

doi:10.1007/s00296-023-05464-6

Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4

Observational Research
Open access
Published: 24 September 2023

Volume 44, pages 303–306, (2024)
Cite this article

Download PDF

You have full access to this open access article

Rheumatology International Aims and scope Submit manuscript

Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4

Download PDF

3765 Accesses
18 Citations
4 Altmetric
Explore all metrics

Abstract

Pre-clinical studies suggest that large language models (i.e., ChatGPT) could be used in the diagnostic process to distinguish inflammatory rheumatic (IRD) from other diseases. We therefore aimed to assess the diagnostic accuracy of ChatGPT-4 in comparison to rheumatologists. For the analysis, the data set of Gräf et al. (2022) was used. Previous patient assessments were analyzed using ChatGPT-4 and compared to rheumatologists’ assessments. ChatGPT-4 listed the correct diagnosis comparable often to rheumatologists as the top diagnosis 35% vs 39% (p = 0.30); as well as among the top 3 diagnoses, 60% vs 55%, (p = 0.38). In IRD-positive cases, ChatGPT-4 provided the top diagnosis in 71% vs 62% in the rheumatologists’ analysis. Correct diagnosis was among the top 3 in 86% (ChatGPT-4) vs 74% (rheumatologists). In non-IRD cases, ChatGPT-4 provided the correct top diagnosis in 15% vs 27% in the rheumatologists’ analysis. Correct diagnosis was among the top 3 in non-IRD cases in 46% of the ChatGPT-4 group vs 45% in the rheumatologists group. If only the first suggestion for diagnosis was considered, ChatGPT-4 correctly classified 58% of cases as IRD compared to 56% of the rheumatologists (p = 0.52). ChatGPT-4 showed a slightly higher accuracy for the top 3 overall diagnoses compared to rheumatologist’s assessment. ChatGPT-4 was able to provide the correct differential diagnosis in a relevant number of cases and achieved better sensitivity to detect IRDs than rheumatologist, at the cost of lower specificity. The pilot results highlight the potential of this new technology as a triage tool for the diagnosis of IRD.

ChatGPT and large language models in orthopedics: from education and surgery to research

Article Open access 01 December 2023

Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios

Article Open access 04 March 2023

ChatGPT and Beyond: An overview of the growing field of large language models and their use in ophthalmology

Article 03 January 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Recent diagnostic and therapeutic advances in rheumatology are still counterbalanced by a shortage of specialists [1] resulting in a significant diagnostic delay [2]. Early and correct diagnosis is, however, essential to prevent persistent joint damage.

In this context, artificial intelligence applications including patient-facing symptom checkers represent a field of interest and could facilitate patient triage and accelerate diagnosis [3, 4]. In 2022, we were able to show that the symptom-checker Ada had a significantly higher diagnostic accuracy than physicians in the evaluation of rheumatological case vignettes [5].

Currently, the introduction of large language models (LLM) such as ChatGPT has raised expectations for their use in medicine [6]. The impact of ChatGPT's arises from its ability to engage in conversations and its performance that is either close to or on par with human capabilities in various cognitive tasks [7]. For instance, Chat-GPT has achieved satisfactory scores on the United States Medical Licensing Examinations [8] and some authors suggest that LLM applications might be suitable for clinical, educational, or research environments [9, 10].

Interestingly, pre-clinical studies suggest that this technology could also be used in the diagnostic process [11, 12] to distinguish inflammatory rheumatic from other diseases.

We therefore aimed to assess the diagnostic accuracy of ChatGPT-4 in comparison to a previous analysis including physicians and symptom checkers regarding rheumatic and musculoskeletal diseases (RMDs).

Methods

For the analysis, the data set of Gräf et al. [5] was used with minor updates to disease classification regarding the grouping of diagnoses. The assessments of the symptom-checker app were analyzed using ChatGPT-4 and compared to the previous assessment results of Ada and the diagnostic ranking of the blinded rheumatologists. ChatGPT-4 was instructed to name the top five differential diagnoses based on the available information of the Ada assessment (see Supplement 1).

All diagnostic suggestions were manually reviewed. If an Inflammatory rheumatic disease (IRD) was among the top three (D3) or top five suggestions (ChatGPT-4 D5), respectively, D3 and D5 were summarized as IRD-positive (even if non-IRD diagnoses were also among the suggestions). Proportions of correctly classified patients were compared between the different groups using McNemar’s test. Classification of inflammatory rheumatic disease (IRD) status was additionally assessed.

Results

ChatGPT-4 listed the correct diagnosis comparable often to physicians as the top diagnosis 35% vs 39% (p = 0.30); as well as among the top 3 diagnoses, 60% vs 55%, (p = 0.38). In IRD-positive cases, ChatGPT-4 provided the top diagnosis in 71% vs 62% in the physician analysis. The correct diagnosis was among the top 3 in 86% (ChatGPT-4) vs 74% (physicians). In non-IRD cases, ChatGPT-4 provided the correct top diagnosis in 15% vs 27% in the physician analysis. The correct diagnosis was among the top 3 in non-IRD cases in 46% of the ChatGPT-4 group vs 45% in the physician group (Fig. 1).

If only the first suggestion for diagnosis was considered, ChatGPT-4 correctly classified 58% of cases as IRD compared to 56% of the rheumatologists (p = 0.52). If the top 3 diagnoses were considered, ChatGPT-4 classified 36% of the cases correctly as IRD vs 52% of the rheumatologists (p = 0.01) (see Fig. 1). ChatGPT-4 had at least one suggestion of an inflammatory diagnosis for all non-IRD cases.

Discussion

ChatGPT-4 showed a slightly higher accuracy (60% vs. 55%) for the top 3 overall diagnoses compared to the rheumatologist’s assessment. It had a higher sensitivity to determine the correct IRD status than rheumatologists, but considerably worse specificity, suggesting that ChatGPT-4 may be particularly useful for detecting IRD patients, where timely diagnosis and treatment initiation are critical. It could therefore potentially be used as a triage tool for digital pre-screening and facilitate quicker referrals of patients with suspected IRDs.

Our results are in line with those of Kanjee et al. [12] who demonstrated an accuracy of 64% for ChatGPT-4 evaluating the top 5 differential diagnoses of the New England Journal of Medicine clinicopathological conferences.

Interestingly, in the cross-sectional study of Ayers et al. [13], the authors found that chatbot responses to publicly asked medical questions on a public social media forum were preferred over physician responses and rated significantly higher for both quality and empathy, highlighting the potential of this technology as a first point of contact and source of information for patients. In summary, ChatGPT-4 was able to provide the correct differential diagnosis in a relevant number of cases and achieved better sensitivity to detect IRDs than a rheumatologist, at the cost of lower specificity.

Although this analysis has some shortcomings, i.e., the small sample size and the limited information (only access to the Ada assessments without further clinical data), it highlights the potential of this new technology as a triage tool that could support or even speed up the diagnosis of RMDs.

As digital self-assessment and remote care options are difficult for some patients due to limited digital health competencies [14], up-to-date studies should be conducted on how accurately patients can express their symptoms and complaints using AI and symptom-checker applications, so that we can benefit from these technologies more effectively.

Until satisfactory results are obtained, the use of artificial intelligence by GPs for effective referral instead of diagnostic use can be expanded and larger prospective studies are recommended to further evaluate the technology. Furthermore, issues, such as ethics, patient consent, and data privacy in the context of the use of artificial intelligence in medical-decision making, are crucial critical guidelines for the application of LLM technologies such as ChatGPT are needed [15].

Data availability

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

References

Rheumadocs und Arbeitskreis Junge Rheumatologie (AGJR), Krusche M, Sewerin P, Kleyer A, Mucke J, Vossen D, u. a. Facharztweiterbildung quo vadis? Z Für Rheumatol. Oktober 2019;78(8):692–7.
Miloslavsky EM, Marston B (2022) The challenge of addressing the rheumatology workforce shortage. J Rheumatol Juni 49(6):555–557
Article Google Scholar
Fuchs F, Morf H, Mohn J, Mühlensiepen F, Ignatyev Y, Bohr D (2023) Diagnostic delay stages and pre-diagnostic treatment in patients with suspected rheumatic diseases before special care consultation: results of a multicenter-based study. Rheumatol Int März 43(3):495–502
Article Google Scholar
Knitza J, Mohn J, Bergmann C, Kampylafka E, Hagen M, Bohr D (2021) Accuracy, patient-perceived usability, and acceptance of two symptom checkers (Ada and Rheport) in rheumatology: interim results from a randomized controlled crossover trial. Arthritis Res Ther 23(1):112
Article CAS PubMed PubMed Central Google Scholar
Gräf M, Knitza J, Leipe J, Krusche M, Welcker M, Kuhn S (2022) Comparison of physician and artificial intelligence-based symptom checker diagnostic accuracy. Rheumatol Int 42(12):2167–2176
Article PubMed PubMed Central Google Scholar
Hügle T (2023) The wide range of opportunities for large language models such as ChatGPT in rheumatology. RMD Open 9(2):e003105
Article PubMed PubMed Central Google Scholar
Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW (2023) Large language models in medicine. Nat Med 29(8):1930–1940
Article CAS PubMed Google Scholar
Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C (2023) Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2(2):e0000198
Article PubMed PubMed Central Google Scholar
Thirunavukarasu AJ, Hassan R, Mahmood S, Sanghera R, Barzangi K, El Mukashfi M (2023) Trialling a large language model (ChatGPT) in general practice with the applied knowledge test: observational study demonstrating opportunities and limitations in primary care. JMIR Med Educ 9:e46599
Article PubMed PubMed Central Google Scholar
Verhoeven F, Wendling D, Prati C (2023) ChatGPT: when artificial intelligence replaces the rheumatologist in medical writing. Ann Rheum Dis 82(8):1015–1017
Article PubMed Google Scholar
Ueda D, Mitsuyama Y, Takita H, Horiuchi D, Walston SL, Tatekawa H (2023) ChatGPT’s diagnostic performance from patient history and imaging findings on the diagnosis please quizzes. Radiology 308(1):e231040
Article PubMed Google Scholar
Kanjee Z, Crowe B, Rodman A (2023) Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 330(1):78
Article PubMed PubMed Central Google Scholar
Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB (2023) Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med 183(6):589–596
Article PubMed Google Scholar
de Thurah A, Bosch P, Marques A, Meissner Y, Mukhtyar CB, Knitza J (2022) EULAR points to consider for remote care in rheumatic and musculoskeletal diseases. Ann Rheum Dis 81(8):1065–1071
Article PubMed Google Scholar
Tools such as ChatGPT threaten transparent science; here are our ground rules for their use. Nature. Januar 2023;613(7945):612.

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL. MK: Speaker fee from Ada, Scientific funding: Ada. JC: Speaker’ fees from Janssen-Cilag, Pfizer, and Idorsia, all unrelated to this work.

Author information

Authors and Affiliations

Division of Rheumatology and Systemic Inflammatory Diseases, University Hospital Hamburg-Eppendorf (UKE), Hamburg, Germany
Martin Krusche & Nikolas Ruffer
Epidemiology Unit, German Rheumatism Research Centre, Berlin, Germany
Johnna Callhoff
Institute for Social Medicine, Epidemiology and Health Economics, Charité Universitätsmedizin, Berlin, Germany
Johnna Callhoff
Institute of Digital Medicine, University Hospital of Giessen and Marburg, Philipps University Marburg, Marburg, Germany
Johannes Knitza
Université Grenoble Alpes, AGEIS, Grenoble, France
Johannes Knitza

Authors

Martin Krusche
View author publications
You can also search for this author in PubMed Google Scholar
Johnna Callhoff
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Knitza
View author publications
You can also search for this author in PubMed Google Scholar
Nikolas Ruffer
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: MK and NR; data curation: all authors, formal analysis: MK and JC, and funding acquisition: not applicable. Investigation: all authors. Methodology: MK, JC, and JK; software: MK; validation: all authors; visualization: MK and JC; writing—original draft: MK and NR; writing—review and editing: all authors.

Corresponding author

Correspondence to Martin Krusche.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 32 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Krusche, M., Callhoff, J., Knitza, J. et al. Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4. Rheumatol Int 44, 303–306 (2024). https://doi.org/10.1007/s00296-023-05464-6

Download citation

Received: 21 August 2023
Accepted: 07 September 2023
Published: 24 September 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s00296-023-05464-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4

Abstract

Similar content being viewed by others

ChatGPT and large language models in orthopedics: from education and surgery to research

Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios

ChatGPT and Beyond: An overview of the growing field of large language models and their use in ophthalmology

Introduction

Methods

Results

Discussion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 32 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4

Abstract

Similar content being viewed by others

ChatGPT and large language models in orthopedics: from education and surgery to research

Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios

ChatGPT and Beyond: An overview of the growing field of large language models and their use in ophthalmology

Introduction

Methods

Results

Discussion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 32 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation