Abstract
Background
Large language model (LLM)-linked chatbots may be an efficient source of clinical recommendations for healthcare providers and patients. This study evaluated the performance of LLM-linked chatbots in providing recommendations for the surgical management of gastroesophageal reflux disease (GERD).
Methods
Nine patient cases were created based on key questions addressed by the Society of American Gastrointestinal and Endoscopic Surgeons (SAGES) guidelines for the surgical treatment of GERD. ChatGPT-3.5, ChatGPT-4, Copilot, Google Bard, and Perplexity AI were queried on November 16th, 2023, for recommendations regarding the surgical management of GERD. Accurate chatbot performance was defined as the number of responses aligning with SAGES guideline recommendations. Outcomes were reported with counts and percentages.
Results
Surgeons were given accurate recommendations for the surgical management of GERD in an adult patient for 5/7 (71.4%) KQs by ChatGPT-4, 3/7 (42.9%) KQs by Copilot, 6/7 (85.7%) KQs by Google Bard, and 3/7 (42.9%) KQs by Perplexity according to the SAGES guidelines. Patients were given accurate recommendations for 3/5 (60.0%) KQs by ChatGPT-4, 2/5 (40.0%) KQs by Copilot, 4/5 (80.0%) KQs by Google Bard, and 1/5 (20.0%) KQs by Perplexity, respectively. In a pediatric patient, surgeons were given accurate recommendations for 2/3 (66.7%) KQs by ChatGPT-4, 3/3 (100.0%) KQs by Copilot, 3/3 (100.0%) KQs by Google Bard, and 2/3 (66.7%) KQs by Perplexity. Patients were given appropriate guidance for 2/2 (100.0%) KQs by ChatGPT-4, 2/2 (100.0%) KQs by Copilot, 1/2 (50.0%) KQs by Google Bard, and 1/2 (50.0%) KQs by Perplexity.
Conclusions
Gastrointestinal surgeons, gastroenterologists, and patients should recognize both the promise and pitfalls of LLM’s when utilized for advice on surgical management of GERD. Additional training of LLM’s using evidence-based health information is needed.
Similar content being viewed by others
References
Meyer JG, Urbanowicz RJ, Martin PCN, O’Connor K, Li R, Peng PC, Bright TJ, Tatonetti N, Won KJ, Gonzalez-Hernandez G, Moore JH (2023) ChatGPT and large language models in academia: opportunities and challenges. BioData Min 16:1–11
Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW (2023) Large language models in medicine. Nat Med 29:1930–1940
Sakirin T, Ben Said R (2023) User preferences for ChatGPT-powered conversational interfaces versus traditional methods. MJCSC. https://doi.org/10.58496/MJCSC/2023/004
Dave T, Athaluri SA, Singh S (2023) ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell 6:1–5
Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, Faix DJ, Goodman AM, Longhurst CA, Hogarth M, Smith DM (2023) Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med 183:589–596. https://doi.org/10.1001/jamainternmed.2023.1838
Lee TC, Staller K, Botoman V, Pathipati MP, Varma S, Kuo B (2023) ChatGPT answers common patient questions about colonoscopy. Gastroenterology 165:509-511.e7
Walker HL, Ghani S, Kuemmerli C, Nebiker CA, Müller BP, Raptis DA, Staubli SM (2023) Reliability of medical information provided by ChatGPT: assessment against clinical guidelines and patient information quality instrument. J Med Internet Res 25:e47479. https://doi.org/10.2196/47479
Amante DJ, Hogan TP, Pagoto SL, English TM, Lapane KL (2015) Access to care and use of the internet to search for health information: results from the US national health interview survey. J Med Internet Res 17:e106. https://doi.org/10.2196/jmir.4126
Wang C, Liu S, Yang H, Guo J, Wu Y, Liu J (2023) Ethical considerations of using ChatGPT in health care. J Med Internet Res 25:1–9
Kamiński M, Łoniewski I, Misera A, Marlicz W (2019) Heartburn-related internet searches and trends of interest across six western countries: a four-year retrospective analysis using google ads keyword planner. Int J Environ Res Public Health 16:1–15. https://doi.org/10.3390/ijerph16234591
Beck F, Richard JB, Nguyen-Thanh V, Montagni I, Parizot I, Renahy E (2014) Use of the internet as a health information resource among French young adults: results from a nationally representative survey. J Med Internet Res 16:1–13. https://doi.org/10.2196/jmir.2934
Mikalef P, Kourouthanassis PE, Pateli AG (2017) Online information search behaviour of physicians. Health Info Libr J 34:58–73. https://doi.org/10.1111/hir.12170
Huo B, Cacciamani GE, Collins GS, McKechnie T, Lee Y, Guyatt G (2023) Reporting standards for the use of large language model-linked chatbots for health advice. Nat Med 29:1
El-Serag HB, Sweet S, Winchester CC, Dent J (2014) Update on the epidemiology of gastro-oesophageal reflux disease: a systematic review. Gut 63:871–880
Henson JB, Glissen Brown JR, Lee JP, Patel A, Leiman DA (2023) Evaluation of the potential utility of an artificial intelligence chatbot in gastroesophageal reflux disease management. Am J Gastroenterol 118:1–4
Slater BJ, Dirks RC, McKinley SK, Ansari MT, Kohn GP, Thosani N, Qumseya B, Billmeier S, Daly S, Crawford C, Ehlers PA, Hollands C, Palazzo F, Rodriguez N, Train A, Wassenaar E, Walsh D, Pryor AD, Stefanidis D (2021) SAGES guidelines for the surgical treatment of gastroesophageal reflux (GERD). Surg Endosc 35:4903–4917. https://doi.org/10.1007/s00464-021-08625-5
Moore M (2016) Gastroesophageal reflux disease: a review of surgical decision making. World J Gastrointest Surg 8:77. https://doi.org/10.4240/wjgs.v8.i1.77
Sachs GF, Ourshalimian S, Jensen AR, Kelley-Quon LI, Padilla BE, Shew SB, Lofberg KM, Smith CA, Roach JP, Pandya SR, Russell KW, Ignacio RC (2023) Machine learning to predict pediatric choledocholithiasis: a western pediatric surgery research consortium retrospective study. Surgery 174:934–939
Marcinkevičs R, Wolfertstetter PR, Klimiene U, Chin-Cheong K, Paschke A, Zerres J, Denzinger M, Niederberger D, Wellmann S, Ozkan E, Knorr C, Vogt JE (2024) Interpretable and intervenable ultrasonography-based machine learning models for pediatric appendicitis. Med Image Anal 91:103042. https://doi.org/10.5281/zenodo.7
Emile SH, Ghareeb W, Elfeki H, El Sorogy M, Fouad A, Elrefai M (2022) Development and validation of an artificial intelligence-based model to predict gastroesophageal reflux disease after sleeve gastrectomy. Obes Surg 32:2537–2547. https://doi.org/10.1007/s11695-022-06112-x
Ge Z, Wang B, Chang J, Yu Z, Zhou Z, Zhang J, Duan Z (2023) Using deep learning and explainable artificial intelligence to assess the severity of gastroesophageal reflux disease according to the los angeles classification system. Scand J Gastroenterol 58:596–604. https://doi.org/10.1080/00365521.2022.2163185
Rahsepar AA, Tavakoli N, Kim GHJ, Hassani C, Abtin F, Bedayat A (2023) How AI responds to common lung cancer questions: ChatGPT vs google bard. Radiology 307:1–12. https://doi.org/10.1148/radiol.230922
Bowman SR (2023) Eight things to know about large language models. arXiv 1–16.
Eysenbach G (2023) The role of ChatGPT, generative language models, and artificial intelligence in medical education: a conversation with ChatGPT and a call for papers. JMIR Med Educ 9:1–13
Beaulieu-Jones BR, Shah S, Berrigan MT, Marwaha JS, Lai S-L, Brat GA (2024) Evaluating capabilities of large language models: performance of GPT4 on surgical knowledge assessments. Surgery 12:1–7. https://doi.org/10.1016/j.surg.2023.12.014
Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, Madriaga M, Aggabao R, Diaz-Candido G, Maningo J, Tseng V (2023) Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2:e0000198. https://doi.org/10.1371/journal.pdig.0000198
Smith DA (2020) Situating wikipedia as a health information resource in various contexts: a scoping review. PLoS ONE 15:1–19. https://doi.org/10.1371/journal.pone.0228786
Lee K, Hoti K, Hughes JD, Emmerton L (2014) Dr google and the consumer: a qualitative study exploring the navigational needs and online health information-seeking behaviors of consumers with chronic health conditions. J Med Internet Res 16:1–14. https://doi.org/10.2196/jmir.3706
Ayoub NF, Lee Y-J, Grimm D, Balakrishnan K (2023) Comparison between ChatGPT and google search as sources of postoperative patient instructions. JAMA Otolaryngol Head Neck Surg 149:555–556
Hristidis V, Ruggiano N, Brown EL, Ganta SRR, Stewart S (2023) ChatGPT vs google for queries related to dementia and other cognitive decline: comparison of results. J Med Internet Res 25:1–13. https://doi.org/10.2196/48966
Mahajan A, Esper S, Oo TH, McKibben J, Garver M, Artman J, Klahre C, Ryan J, Sadhasivam S, Holder-Murray J, Marroquin OC (2023) Development and validation of a machine learning model to identify patients before surgery at high risk for postoperative adverse events. JAMA Netw Open 6:E2322285. https://doi.org/10.1001/jamanetworkopen.2023.22285
Acknowledgements
The authors would like to thank the SAGES Guideline Committee for their expert guidance in the development of this manuscript.
Funding
This study received no funding.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Disclosures
Walsh is Co-Chair of the Guidelines Committee for Society of Gastrointestinal and Endoscopic Surgeons. Walsh is a Member of the American College of Surgeons Health Information Technology Committee and the Board of Governors. Slater is a consultant for Cook Medical and Hologic. Slater is the Chair of the Guidelines Committee for Society of American Gastrointestinal and Endoscopic Surgeons (SAGES). Sylla is a consultant for Safeheal, Ethicon, Stryker and Tissium. Sylla is the president of SAGES. Huo, Calabrese, Kumar, Ignacio, Oviedo, Hassan, Kaiser, and Vosburg have no conflicts of interest to disclose.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Huo, B., Calabrese, E., Sylla, P. et al. The performance of artificial intelligence large language model-linked chatbots in surgical decision-making for gastroesophageal reflux disease. Surg Endosc (2024). https://doi.org/10.1007/s00464-024-10807-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00464-024-10807-w