Skip to main content
Log in

Repeatability, reproducibility, and diagnostic accuracy of a commercial large language model (ChatGPT) to perform emergency department triage using the Canadian triage and acuity scale

  • Original Research
  • Published:
Canadian Journal of Emergency Medicine Aims and scope Submit manuscript

Abstract

Purpose

The release of the ChatGPT prototype to the public in November 2022 drastically reduced the barrier to using artificial intelligence by allowing easy access to a large language model with only a simple web interface. One situation where ChatGPT could be useful is in triaging patients arriving to the emergency department. This study aimed to address the research problem: “can emergency physicians use ChatGPT to accurately triage patients using the Canadian Triage and Acuity Scale (CTAS)?”.

Methods

Six unique prompts were developed independently by five emergency physicians. An automated script was used to query ChatGPT with each of the 6 prompts combined with 61 validated and previously published patient vignettes. Thirty repetitions of each combination were performed for a total of 10,980 simulated triages.

Results

In 99.6% of 10,980 queries, a CTAS score was returned. However, there was considerable variations in results. Repeatability (use of the same prompt repeatedly) was responsible for 21.0% of overall variation. Reproducibility (use of different prompts) was responsible for 4.0% of overall variation. Overall accuracy of ChatGPT to triage simulated patients was 47.5% with a 13.7% under-triage rate and a 38.7% over-triage rate. More extensively detailed text given as a prompt was associated with greater reproducibility, but minimal increase in accuracy.

Conclusions

This study suggests that the current ChatGPT large language model is not sufficient for emergency physicians to triage simulated patients using the Canadian Triage and Acuity Scale due to poor repeatability and accuracy. Medical practitioners should be aware that while ChatGPT can be a valuable tool, it may lack consistency and may frequently provide false information.

Abstrait

Objectif

La sortie du prototype ChatGPT au public en novembre 2022 a considérablement réduit l’obstacle à l’utilisation de l’intelligence artificielle en permettant un accès facile à un grand modèle de langage avec une interface web simple. Une situation où ChatGPT pourrait être utile est de trier les patients qui arrivent au service d’urgence. Cette étude visait à résoudre le problème de la recherche : «Les médecins d’urgence peuvent-ils utiliser ChatGPT pour trier avec précision les patients à l’aide de l’Échelle canadienne de triage et d’acuité (ECTC) ?».

Méthodes

Six invites uniques ont été élaborées indépendamment par cinq urgentologues. Un script automatisé a été utilisé pour interroger ChatGPT avec chacune des six invites combinées à 61 vignettes de patients validées et précédemment publiées. Trente répétitions de chaque combinaison ont été réalisées pour un total de 10980 triages simulés.

Résultats

Dans 99.6 % des 10980 requêtes, un score CTAS a été obtenu. Cependant, il y a eu des variations considérables dans les résultats. La répétabilité (utilisation répétée de la même invite) était responsable de 21.0 % de la variation globale. La reproductibilité (utilisation de différentes invites) était responsable de 4.0 % de la variation globale. La précision globale de ChatGPT pour le triage des patients simulés était de 47.5 %, avec un taux de sous-triage de 13.7 % et un taux de triage supérieur de 38.7 %. Un texte plus détaillé donné à titre d’invite était associé à une plus grande reproductibilité, mais à une augmentation minimale de la précision.

Conclusions

Cette étude suggère que le modèle actuel de ChatGPT en langage large n’est pas suffisant pour permettre aux médecins d’urgence de trier des patients simulés à l’aide de l’échelle canadienne de triage et d’acuité en raison de la faible répétabilité et de la faible précision. Les médecins doivent être conscients que, bien que ChatGPT puisse être un outil précieux, il peut manquer de cohérence et fournir fréquemment de fausses informations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

The data is not published online, but we might make our data available by request for non commercial users by contacting the corresponding author.

References

  1. Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 2023;15: e35179.

    PubMed  PubMed Central  Google Scholar 

  2. Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595.

    Article  PubMed  PubMed Central  Google Scholar 

  3. The Canadian triage and acuity scale. https://ctas-phctas.ca. Accessed 15 July 2023.

  4. Choudhary N. A study in measurement. Qual Prog. 2017;50:42–7.

    Google Scholar 

  5. GR&R - gage repeatability and reproducibility. https://asq.org/quality-resources/gage-repeatability. Accessed 19 Sep 2023.

  6. Ott ER, Ellis R, Schilling EG, Neubauer DV. Process quality control: troubleshooting and interpretation of data. 4th ed. Milwaukee: ASQ Quality Press; 2005. p. 530–45.

    Google Scholar 

  7. Canadian Association of Emergency Medicine. The Canadian triage and acuity scale combined adult/pediatric education program participants manual version 2.5a. 2013. http://ctas-phctas.ca/wp-content/uploads/2018/05/participant_manual_v2.5b_november_2013_0.pdf. Accessed 15 Mar 2023.

  8. Curran-Sills G, Franc JM. A pilot study examining the speed and accuracy of triage for simulated disaster patients in an emergency department setting: comparison of a computerized version of Canadian Triage Acuity Scale (CTAS) and Simple Triage and Rapid Treatment (START) methods. Can Emerg Med. 2017;19:664–371.

    Google Scholar 

  9. Dong SL, Bullard MJ, Meurer DP, Blitz S, Ohinmaa A, Holroyd BR, et al. Reliability of computerized emergency triage. Acad Emerg Med. 2006;13:269–75.

    Article  PubMed  Google Scholar 

  10. Van Rossum G, Drake FL, Harris CR, Millman KJ, van der Walt SJ, Gommers R, et al. Python 3 reference manual. Nature. 2009;585:357–62.

    Google Scholar 

  11. Paret M, Gage SJ. R&R: are 10 parts, 3 operators, and 2 replicates enough? Quality Mag. 2017. https://www.qualitymag.com/ext/resources/files/white_papers/minitab/GageRRWhitePaper.pdf.

  12. McLeod SL, McCarron J, Ahmed T, Grewal K, Mittmann N, Scott S, Ovens H, Garay J, Bullard M, Rowe BH, Dreyer J, Borgundvaag B. Interrater reliability, accuracy, and triage time pre- and post-implementation of a real-time electronic triage decision-support tool. Ann Emerg Med. 2020;75:524–31.

    Article  PubMed  Google Scholar 

  13. Tam HL, Chung SF, Lou CK. A review of triage accuracy and future direction. BMC Emerg Med. 2018. https://doi.org/10.1186/s12873-018-0215-0.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Franc JM, Kirkland SW, Wisnesky UD, Campbell S, Rowe BH. METASTART: a systematic review and meta-analysis of the diagnostic accuracy of the simple triage and rapid treatment (START) algorithm for disaster triage. Prehosp Disaster Med. 2022;37:106–16.

    Article  PubMed  Google Scholar 

  15. Cadamuro J, Cabitza F, Debeljak Z, De Bruyne S, Frans G, Perez SM, et al. Potentials and pitfalls of ChatGPT and natural-language artificial intelligence models for the understanding of laboratory medicine test results. An assessment by the European Federation of Clinical Chemistry and Laboratory Medicine (EFLM) Working Group on Artificial Intelligence (WG-AI). Clin Chem Lab Med. 2023;61:1158–66.

    Article  CAS  PubMed  Google Scholar 

  16. Balas M, Ing EB. Conversational AI models for ophthalmic diagnosis: comparison of ChatGPT and the Isabel pro differential diagnosis generator. JFO Open Ophthalmol. 2023;1: 100005.

    Article  Google Scholar 

  17. Harskamp RE, De Clercq L. Performance of ChatGPT as an AI-assisted decision support tool in medicine: a proof-of-concept study for interpreting symptoms and management of common cardiac conditions (AMSTELHEART-2). medRxiv. 2023. https://doi.org/10.1101/2023.03.25.23285475

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank Sandra Franchuk for her help in editing the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jeffrey Michael Franc.

Ethics declarations

Conflict of interest

JMF is the CEO and Founder of STAT59. All the other authors declare that they have no conflicts of interest.

Additional information

Communicated by Ian Drennan.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 6 KB)

Supplementary file2 (DOCX 6 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Franc, J.M., Cheng, L., Hart, A. et al. Repeatability, reproducibility, and diagnostic accuracy of a commercial large language model (ChatGPT) to perform emergency department triage using the Canadian triage and acuity scale. Can J Emerg Med 26, 40–46 (2024). https://doi.org/10.1007/s43678-023-00616-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s43678-023-00616-w

Keywords

Mots-clés

Navigation