Repeatability, reproducibility, and diagnostic accuracy of a commercial large language model (ChatGPT) to perform emergency department triage using the Canadian triage and acuity scale

Franc, Jeffrey Michael; Cheng, Lenard; Hart, Alexander; Hata, Ryan; Hertelendy, Atilla

doi:10.1007/s43678-023-00616-w

Repeatability, reproducibility, and diagnostic accuracy of a commercial large language model (ChatGPT) to perform emergency department triage using the Canadian triage and acuity scale

Original Research
Published: 11 January 2024

Volume 26, pages 40–46, (2024)
Cite this article

Canadian Journal of Emergency Medicine Aims and scope Submit manuscript

Jeffrey Michael Franc ORCID: orcid.org/0000-0002-2421-3479^1,2,3,
Lenard Cheng^4,5,
Alexander Hart^6,7,4,
Ryan Hata^4,5 &
…
Atilla Hertelendy^4,8

512 Accesses
1 Citation
6 Altmetric
1 Mention
Explore all metrics

Abstract

Purpose

The release of the ChatGPT prototype to the public in November 2022 drastically reduced the barrier to using artificial intelligence by allowing easy access to a large language model with only a simple web interface. One situation where ChatGPT could be useful is in triaging patients arriving to the emergency department. This study aimed to address the research problem: “can emergency physicians use ChatGPT to accurately triage patients using the Canadian Triage and Acuity Scale (CTAS)?”.

Methods

Six unique prompts were developed independently by five emergency physicians. An automated script was used to query ChatGPT with each of the 6 prompts combined with 61 validated and previously published patient vignettes. Thirty repetitions of each combination were performed for a total of 10,980 simulated triages.

Results

In 99.6% of 10,980 queries, a CTAS score was returned. However, there was considerable variations in results. Repeatability (use of the same prompt repeatedly) was responsible for 21.0% of overall variation. Reproducibility (use of different prompts) was responsible for 4.0% of overall variation. Overall accuracy of ChatGPT to triage simulated patients was 47.5% with a 13.7% under-triage rate and a 38.7% over-triage rate. More extensively detailed text given as a prompt was associated with greater reproducibility, but minimal increase in accuracy.

Conclusions

This study suggests that the current ChatGPT large language model is not sufficient for emergency physicians to triage simulated patients using the Canadian Triage and Acuity Scale due to poor repeatability and accuracy. Medical practitioners should be aware that while ChatGPT can be a valuable tool, it may lack consistency and may frequently provide false information.

Abstrait

Objectif

La sortie du prototype ChatGPT au public en novembre 2022 a considérablement réduit l’obstacle à l’utilisation de l’intelligence artificielle en permettant un accès facile à un grand modèle de langage avec une interface web simple. Une situation où ChatGPT pourrait être utile est de trier les patients qui arrivent au service d’urgence. Cette étude visait à résoudre le problème de la recherche : «Les médecins d’urgence peuvent-ils utiliser ChatGPT pour trier avec précision les patients à l’aide de l’Échelle canadienne de triage et d’acuité (ECTC) ?».

Méthodes

Six invites uniques ont été élaborées indépendamment par cinq urgentologues. Un script automatisé a été utilisé pour interroger ChatGPT avec chacune des six invites combinées à 61 vignettes de patients validées et précédemment publiées. Trente répétitions de chaque combinaison ont été réalisées pour un total de 10980 triages simulés.

Résultats

Dans 99.6 % des 10980 requêtes, un score CTAS a été obtenu. Cependant, il y a eu des variations considérables dans les résultats. La répétabilité (utilisation répétée de la même invite) était responsable de 21.0 % de la variation globale. La reproductibilité (utilisation de différentes invites) était responsable de 4.0 % de la variation globale. La précision globale de ChatGPT pour le triage des patients simulés était de 47.5 %, avec un taux de sous-triage de 13.7 % et un taux de triage supérieur de 38.7 %. Un texte plus détaillé donné à titre d’invite était associé à une plus grande reproductibilité, mais à une augmentation minimale de la précision.

Conclusions

Cette étude suggère que le modèle actuel de ChatGPT en langage large n’est pas suffisant pour permettre aux médecins d’urgence de trier des patients simulés à l’aide de l’échelle canadienne de triage et d’acuité en raison de la faible répétabilité et de la faible précision. Les médecins doivent être conscients que, bien que ChatGPT puisse être un outil précieux, il peut manquer de cohérence et fournir fréquemment de fausses informations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

“ChatGPT, Can You Help Me Save My Child’s Life?” - Diagnostic Accuracy and Supportive Capabilities to Lay Rescuers by ChatGPT in Prehospital Basic Life Support and Paediatric Advanced Life Support Cases – An In-silico Analysis

Article Open access 21 November 2023

Large language models: Are artificial intelligence-based chatbots a reliable source of patient information for spinal surgery?

Article 11 October 2023

ChatGPT-HealthPrompt. Harnessing the Power of XAI in Prompt-Based Healthcare Decision Support using ChatGPT

Data availability

The data is not published online, but we might make our data available by request for non commercial users by contacting the corresponding author.

References

Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 2023;15: e35179.
PubMed PubMed Central Google Scholar
Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595.
Article PubMed PubMed Central Google Scholar
The Canadian triage and acuity scale. https://ctas-phctas.ca. Accessed 15 July 2023.
Choudhary N. A study in measurement. Qual Prog. 2017;50:42–7.
Google Scholar
GR&R - gage repeatability and reproducibility. https://asq.org/quality-resources/gage-repeatability. Accessed 19 Sep 2023.
Ott ER, Ellis R, Schilling EG, Neubauer DV. Process quality control: troubleshooting and interpretation of data. 4th ed. Milwaukee: ASQ Quality Press; 2005. p. 530–45.
Google Scholar
Canadian Association of Emergency Medicine. The Canadian triage and acuity scale combined adult/pediatric education program participants manual version 2.5a. 2013. http://ctas-phctas.ca/wp-content/uploads/2018/05/participant_manual_v2.5b_november_2013_0.pdf. Accessed 15 Mar 2023.
Curran-Sills G, Franc JM. A pilot study examining the speed and accuracy of triage for simulated disaster patients in an emergency department setting: comparison of a computerized version of Canadian Triage Acuity Scale (CTAS) and Simple Triage and Rapid Treatment (START) methods. Can Emerg Med. 2017;19:664–371.
Google Scholar
Dong SL, Bullard MJ, Meurer DP, Blitz S, Ohinmaa A, Holroyd BR, et al. Reliability of computerized emergency triage. Acad Emerg Med. 2006;13:269–75.
Article PubMed Google Scholar
Van Rossum G, Drake FL, Harris CR, Millman KJ, van der Walt SJ, Gommers R, et al. Python 3 reference manual. Nature. 2009;585:357–62.
Google Scholar
Paret M, Gage SJ. R&R: are 10 parts, 3 operators, and 2 replicates enough? Quality Mag. 2017. https://www.qualitymag.com/ext/resources/files/white_papers/minitab/GageRRWhitePaper.pdf.
McLeod SL, McCarron J, Ahmed T, Grewal K, Mittmann N, Scott S, Ovens H, Garay J, Bullard M, Rowe BH, Dreyer J, Borgundvaag B. Interrater reliability, accuracy, and triage time pre- and post-implementation of a real-time electronic triage decision-support tool. Ann Emerg Med. 2020;75:524–31.
Article PubMed Google Scholar
Tam HL, Chung SF, Lou CK. A review of triage accuracy and future direction. BMC Emerg Med. 2018. https://doi.org/10.1186/s12873-018-0215-0.
Article PubMed PubMed Central Google Scholar
Franc JM, Kirkland SW, Wisnesky UD, Campbell S, Rowe BH. METASTART: a systematic review and meta-analysis of the diagnostic accuracy of the simple triage and rapid treatment (START) algorithm for disaster triage. Prehosp Disaster Med. 2022;37:106–16.
Article PubMed Google Scholar
Cadamuro J, Cabitza F, Debeljak Z, De Bruyne S, Frans G, Perez SM, et al. Potentials and pitfalls of ChatGPT and natural-language artificial intelligence models for the understanding of laboratory medicine test results. An assessment by the European Federation of Clinical Chemistry and Laboratory Medicine (EFLM) Working Group on Artificial Intelligence (WG-AI). Clin Chem Lab Med. 2023;61:1158–66.
Article CAS PubMed Google Scholar
Balas M, Ing EB. Conversational AI models for ophthalmic diagnosis: comparison of ChatGPT and the Isabel pro differential diagnosis generator. JFO Open Ophthalmol. 2023;1: 100005.
Article Google Scholar
Harskamp RE, De Clercq L. Performance of ChatGPT as an AI-assisted decision support tool in medicine: a proof-of-concept study for interpreting symptoms and management of common cardiac conditions (AMSTELHEART-2). medRxiv. 2023. https://doi.org/10.1101/2023.03.25.23285475
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank Sandra Franchuk for her help in editing the manuscript.

Author information

Authors and Affiliations

Department of Emergency Medicine, University of Alberta, Edmonton, AB, Canada
Jeffrey Michael Franc
Faculty of Medicine, University of Alberta, Edmonton, AB, Canada
Jeffrey Michael Franc
Università del Piemonte Orientale, Novara, Italy
Jeffrey Michael Franc
Department of Emergency Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA
Lenard Cheng, Alexander Hart, Ryan Hata & Atilla Hertelendy
Harvard Medical School, Boston, MA, USA
Lenard Cheng & Ryan Hata
Department of Emergency Medicine, Hartford Hospital, Hartford, CT, USA
Alexander Hart
University of Connecticut School of Medicine, Farmington, CT, USA
Alexander Hart
Department of Information Systems and Business Analytics, College of Business, Florida International University, Miami, FL, USA
Atilla Hertelendy

Authors

Jeffrey Michael Franc
View author publications
You can also search for this author in PubMed Google Scholar
Lenard Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Hart
View author publications
You can also search for this author in PubMed Google Scholar
Ryan Hata
View author publications
You can also search for this author in PubMed Google Scholar
Atilla Hertelendy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jeffrey Michael Franc.

Ethics declarations

Conflict of interest

JMF is the CEO and Founder of STAT59. All the other authors declare that they have no conflicts of interest.

Additional information

Communicated by Ian Drennan.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 6 KB)

Supplementary file2 (DOCX 6 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Franc, J.M., Cheng, L., Hart, A. et al. Repeatability, reproducibility, and diagnostic accuracy of a commercial large language model (ChatGPT) to perform emergency department triage using the Canadian triage and acuity scale. Can J Emerg Med 26, 40–46 (2024). https://doi.org/10.1007/s43678-023-00616-w

Download citation

Received: 24 April 2023
Accepted: 29 October 2023
Published: 11 January 2024
Issue Date: January 2024
DOI: https://doi.org/10.1007/s43678-023-00616-w

Keywords

Mots-clés

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Repeatability, reproducibility, and diagnostic accuracy of a commercial large language model (ChatGPT) to perform emergency department triage using the Canadian triage and acuity scale