Abstract
Purpose
The release of the ChatGPT prototype to the public in November 2022 drastically reduced the barrier to using artificial intelligence by allowing easy access to a large language model with only a simple web interface. One situation where ChatGPT could be useful is in triaging patients arriving to the emergency department. This study aimed to address the research problem: “can emergency physicians use ChatGPT to accurately triage patients using the Canadian Triage and Acuity Scale (CTAS)?”.
Methods
Six unique prompts were developed independently by five emergency physicians. An automated script was used to query ChatGPT with each of the 6 prompts combined with 61 validated and previously published patient vignettes. Thirty repetitions of each combination were performed for a total of 10,980 simulated triages.
Results
In 99.6% of 10,980 queries, a CTAS score was returned. However, there was considerable variations in results. Repeatability (use of the same prompt repeatedly) was responsible for 21.0% of overall variation. Reproducibility (use of different prompts) was responsible for 4.0% of overall variation. Overall accuracy of ChatGPT to triage simulated patients was 47.5% with a 13.7% under-triage rate and a 38.7% over-triage rate. More extensively detailed text given as a prompt was associated with greater reproducibility, but minimal increase in accuracy.
Conclusions
This study suggests that the current ChatGPT large language model is not sufficient for emergency physicians to triage simulated patients using the Canadian Triage and Acuity Scale due to poor repeatability and accuracy. Medical practitioners should be aware that while ChatGPT can be a valuable tool, it may lack consistency and may frequently provide false information.
Abstrait
Objectif
La sortie du prototype ChatGPT au public en novembre 2022 a considérablement réduit l’obstacle à l’utilisation de l’intelligence artificielle en permettant un accès facile à un grand modèle de langage avec une interface web simple. Une situation où ChatGPT pourrait être utile est de trier les patients qui arrivent au service d’urgence. Cette étude visait à résoudre le problème de la recherche : «Les médecins d’urgence peuvent-ils utiliser ChatGPT pour trier avec précision les patients à l’aide de l’Échelle canadienne de triage et d’acuité (ECTC) ?».
Méthodes
Six invites uniques ont été élaborées indépendamment par cinq urgentologues. Un script automatisé a été utilisé pour interroger ChatGPT avec chacune des six invites combinées à 61 vignettes de patients validées et précédemment publiées. Trente répétitions de chaque combinaison ont été réalisées pour un total de 10980 triages simulés.
Résultats
Dans 99.6 % des 10980 requêtes, un score CTAS a été obtenu. Cependant, il y a eu des variations considérables dans les résultats. La répétabilité (utilisation répétée de la même invite) était responsable de 21.0 % de la variation globale. La reproductibilité (utilisation de différentes invites) était responsable de 4.0 % de la variation globale. La précision globale de ChatGPT pour le triage des patients simulés était de 47.5 %, avec un taux de sous-triage de 13.7 % et un taux de triage supérieur de 38.7 %. Un texte plus détaillé donné à titre d’invite était associé à une plus grande reproductibilité, mais à une augmentation minimale de la précision.
Conclusions
Cette étude suggère que le modèle actuel de ChatGPT en langage large n’est pas suffisant pour permettre aux médecins d’urgence de trier des patients simulés à l’aide de l’échelle canadienne de triage et d’acuité en raison de la faible répétabilité et de la faible précision. Les médecins doivent être conscients que, bien que ChatGPT puisse être un outil précieux, il peut manquer de cohérence et fournir fréquemment de fausses informations.
Similar content being viewed by others
Data availability
The data is not published online, but we might make our data available by request for non commercial users by contacting the corresponding author.
References
Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 2023;15: e35179.
Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595.
The Canadian triage and acuity scale. https://ctas-phctas.ca. Accessed 15 July 2023.
Choudhary N. A study in measurement. Qual Prog. 2017;50:42–7.
GR&R - gage repeatability and reproducibility. https://asq.org/quality-resources/gage-repeatability. Accessed 19 Sep 2023.
Ott ER, Ellis R, Schilling EG, Neubauer DV. Process quality control: troubleshooting and interpretation of data. 4th ed. Milwaukee: ASQ Quality Press; 2005. p. 530–45.
Canadian Association of Emergency Medicine. The Canadian triage and acuity scale combined adult/pediatric education program participants manual version 2.5a. 2013. http://ctas-phctas.ca/wp-content/uploads/2018/05/participant_manual_v2.5b_november_2013_0.pdf. Accessed 15 Mar 2023.
Curran-Sills G, Franc JM. A pilot study examining the speed and accuracy of triage for simulated disaster patients in an emergency department setting: comparison of a computerized version of Canadian Triage Acuity Scale (CTAS) and Simple Triage and Rapid Treatment (START) methods. Can Emerg Med. 2017;19:664–371.
Dong SL, Bullard MJ, Meurer DP, Blitz S, Ohinmaa A, Holroyd BR, et al. Reliability of computerized emergency triage. Acad Emerg Med. 2006;13:269–75.
Van Rossum G, Drake FL, Harris CR, Millman KJ, van der Walt SJ, Gommers R, et al. Python 3 reference manual. Nature. 2009;585:357–62.
Paret M, Gage SJ. R&R: are 10 parts, 3 operators, and 2 replicates enough? Quality Mag. 2017. https://www.qualitymag.com/ext/resources/files/white_papers/minitab/GageRRWhitePaper.pdf.
McLeod SL, McCarron J, Ahmed T, Grewal K, Mittmann N, Scott S, Ovens H, Garay J, Bullard M, Rowe BH, Dreyer J, Borgundvaag B. Interrater reliability, accuracy, and triage time pre- and post-implementation of a real-time electronic triage decision-support tool. Ann Emerg Med. 2020;75:524–31.
Tam HL, Chung SF, Lou CK. A review of triage accuracy and future direction. BMC Emerg Med. 2018. https://doi.org/10.1186/s12873-018-0215-0.
Franc JM, Kirkland SW, Wisnesky UD, Campbell S, Rowe BH. METASTART: a systematic review and meta-analysis of the diagnostic accuracy of the simple triage and rapid treatment (START) algorithm for disaster triage. Prehosp Disaster Med. 2022;37:106–16.
Cadamuro J, Cabitza F, Debeljak Z, De Bruyne S, Frans G, Perez SM, et al. Potentials and pitfalls of ChatGPT and natural-language artificial intelligence models for the understanding of laboratory medicine test results. An assessment by the European Federation of Clinical Chemistry and Laboratory Medicine (EFLM) Working Group on Artificial Intelligence (WG-AI). Clin Chem Lab Med. 2023;61:1158–66.
Balas M, Ing EB. Conversational AI models for ophthalmic diagnosis: comparison of ChatGPT and the Isabel pro differential diagnosis generator. JFO Open Ophthalmol. 2023;1: 100005.
Harskamp RE, De Clercq L. Performance of ChatGPT as an AI-assisted decision support tool in medicine: a proof-of-concept study for interpreting symptoms and management of common cardiac conditions (AMSTELHEART-2). medRxiv. 2023. https://doi.org/10.1101/2023.03.25.23285475
Acknowledgements
The authors would like to thank Sandra Franchuk for her help in editing the manuscript.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
JMF is the CEO and Founder of STAT59. All the other authors declare that they have no conflicts of interest.
Additional information
Communicated by Ian Drennan.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Franc, J.M., Cheng, L., Hart, A. et al. Repeatability, reproducibility, and diagnostic accuracy of a commercial large language model (ChatGPT) to perform emergency department triage using the Canadian triage and acuity scale. Can J Emerg Med 26, 40–46 (2024). https://doi.org/10.1007/s43678-023-00616-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s43678-023-00616-w
Keywords
- Emergency medicine
- Triage
- Artificial intelligence
- Large language models
- Canadian triage and acuity scale