The performance of artificial intelligence large language model-linked chatbots in surgical decision-making for gastroesophageal reflux disease

Huo, Bright; Calabrese, Elisa; Sylla, Patricia; Kumar, Sunjay; Ignacio, Romeo C.; Oviedo, Rodolfo; Hassan, Imran; Slater, Bethany J.; Kaiser, Andreas; Walsh, Danielle S.; Vosburg, Wesley

doi:10.1007/s00464-024-10807-w

The performance of artificial intelligence large language model-linked chatbots in surgical decision-making for gastroesophageal reflux disease

SAGES/EAES Official Publication
Published: 17 April 2024

(2024)
Cite this article

Surgical Endoscopy Aims and scope Submit manuscript

Bright Huo¹,
Elisa Calabrese²,
Patricia Sylla³,
Sunjay Kumar⁴,
Romeo C. Ignacio⁵,
Rodolfo Oviedo^6,7,8,
Imran Hassan⁹,
Bethany J. Slater¹⁰,
Andreas Kaiser¹¹,
Danielle S. Walsh¹² &
…
Wesley Vosburg ORCID: orcid.org/0000-0003-3136-7879¹³

99 Accesses
7 Altmetric
Explore all metrics

Abstract

Background

Large language model (LLM)-linked chatbots may be an efficient source of clinical recommendations for healthcare providers and patients. This study evaluated the performance of LLM-linked chatbots in providing recommendations for the surgical management of gastroesophageal reflux disease (GERD).

Methods

Nine patient cases were created based on key questions addressed by the Society of American Gastrointestinal and Endoscopic Surgeons (SAGES) guidelines for the surgical treatment of GERD. ChatGPT-3.5, ChatGPT-4, Copilot, Google Bard, and Perplexity AI were queried on November 16th, 2023, for recommendations regarding the surgical management of GERD. Accurate chatbot performance was defined as the number of responses aligning with SAGES guideline recommendations. Outcomes were reported with counts and percentages.

Results

Surgeons were given accurate recommendations for the surgical management of GERD in an adult patient for 5/7 (71.4%) KQs by ChatGPT-4, 3/7 (42.9%) KQs by Copilot, 6/7 (85.7%) KQs by Google Bard, and 3/7 (42.9%) KQs by Perplexity according to the SAGES guidelines. Patients were given accurate recommendations for 3/5 (60.0%) KQs by ChatGPT-4, 2/5 (40.0%) KQs by Copilot, 4/5 (80.0%) KQs by Google Bard, and 1/5 (20.0%) KQs by Perplexity, respectively. In a pediatric patient, surgeons were given accurate recommendations for 2/3 (66.7%) KQs by ChatGPT-4, 3/3 (100.0%) KQs by Copilot, 3/3 (100.0%) KQs by Google Bard, and 2/3 (66.7%) KQs by Perplexity. Patients were given appropriate guidance for 2/2 (100.0%) KQs by ChatGPT-4, 2/2 (100.0%) KQs by Copilot, 1/2 (50.0%) KQs by Google Bard, and 1/2 (50.0%) KQs by Perplexity.

Conclusions

Gastrointestinal surgeons, gastroenterologists, and patients should recognize both the promise and pitfalls of LLM’s when utilized for advice on surgical management of GERD. Additional training of LLM’s using evidence-based health information is needed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Large language models: Are artificial intelligence-based chatbots a reliable source of patient information for spinal surgery?

Article 11 October 2023

ChatGPT and large language models in orthopedics: from education and surgery to research

Article Open access 01 December 2023

Exploring the potential of ChatGPT as a supplementary tool for providing orthopaedic information

Article Open access 08 August 2023

References

Meyer JG, Urbanowicz RJ, Martin PCN, O’Connor K, Li R, Peng PC, Bright TJ, Tatonetti N, Won KJ, Gonzalez-Hernandez G, Moore JH (2023) ChatGPT and large language models in academia: opportunities and challenges. BioData Min 16:1–11
Article Google Scholar
Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW (2023) Large language models in medicine. Nat Med 29:1930–1940
Article CAS PubMed Google Scholar
Sakirin T, Ben Said R (2023) User preferences for ChatGPT-powered conversational interfaces versus traditional methods. MJCSC. https://doi.org/10.58496/MJCSC/2023/004
Article Google Scholar
Dave T, Athaluri SA, Singh S (2023) ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell 6:1–5
Article Google Scholar
Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, Faix DJ, Goodman AM, Longhurst CA, Hogarth M, Smith DM (2023) Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med 183:589–596. https://doi.org/10.1001/jamainternmed.2023.1838
Article PubMed Google Scholar
Lee TC, Staller K, Botoman V, Pathipati MP, Varma S, Kuo B (2023) ChatGPT answers common patient questions about colonoscopy. Gastroenterology 165:509-511.e7
Article PubMed Google Scholar
Walker HL, Ghani S, Kuemmerli C, Nebiker CA, Müller BP, Raptis DA, Staubli SM (2023) Reliability of medical information provided by ChatGPT: assessment against clinical guidelines and patient information quality instrument. J Med Internet Res 25:e47479. https://doi.org/10.2196/47479
Article PubMed PubMed Central Google Scholar
Amante DJ, Hogan TP, Pagoto SL, English TM, Lapane KL (2015) Access to care and use of the internet to search for health information: results from the US national health interview survey. J Med Internet Res 17:e106. https://doi.org/10.2196/jmir.4126
Article PubMed PubMed Central Google Scholar
Wang C, Liu S, Yang H, Guo J, Wu Y, Liu J (2023) Ethical considerations of using ChatGPT in health care. J Med Internet Res 25:1–9
Article Google Scholar
Kamiński M, Łoniewski I, Misera A, Marlicz W (2019) Heartburn-related internet searches and trends of interest across six western countries: a four-year retrospective analysis using google ads keyword planner. Int J Environ Res Public Health 16:1–15. https://doi.org/10.3390/ijerph16234591
Article Google Scholar
Beck F, Richard JB, Nguyen-Thanh V, Montagni I, Parizot I, Renahy E (2014) Use of the internet as a health information resource among French young adults: results from a nationally representative survey. J Med Internet Res 16:1–13. https://doi.org/10.2196/jmir.2934
Article Google Scholar
Mikalef P, Kourouthanassis PE, Pateli AG (2017) Online information search behaviour of physicians. Health Info Libr J 34:58–73. https://doi.org/10.1111/hir.12170
Article PubMed Google Scholar
Huo B, Cacciamani GE, Collins GS, McKechnie T, Lee Y, Guyatt G (2023) Reporting standards for the use of large language model-linked chatbots for health advice. Nat Med 29:1
Article Google Scholar
El-Serag HB, Sweet S, Winchester CC, Dent J (2014) Update on the epidemiology of gastro-oesophageal reflux disease: a systematic review. Gut 63:871–880
Article PubMed Google Scholar
Henson JB, Glissen Brown JR, Lee JP, Patel A, Leiman DA (2023) Evaluation of the potential utility of an artificial intelligence chatbot in gastroesophageal reflux disease management. Am J Gastroenterol 118:1–4
Article Google Scholar
Slater BJ, Dirks RC, McKinley SK, Ansari MT, Kohn GP, Thosani N, Qumseya B, Billmeier S, Daly S, Crawford C, Ehlers PA, Hollands C, Palazzo F, Rodriguez N, Train A, Wassenaar E, Walsh D, Pryor AD, Stefanidis D (2021) SAGES guidelines for the surgical treatment of gastroesophageal reflux (GERD). Surg Endosc 35:4903–4917. https://doi.org/10.1007/s00464-021-08625-5
Article PubMed Google Scholar
Moore M (2016) Gastroesophageal reflux disease: a review of surgical decision making. World J Gastrointest Surg 8:77. https://doi.org/10.4240/wjgs.v8.i1.77
Article PubMed PubMed Central Google Scholar
Sachs GF, Ourshalimian S, Jensen AR, Kelley-Quon LI, Padilla BE, Shew SB, Lofberg KM, Smith CA, Roach JP, Pandya SR, Russell KW, Ignacio RC (2023) Machine learning to predict pediatric choledocholithiasis: a western pediatric surgery research consortium retrospective study. Surgery 174:934–939
Article Google Scholar
Marcinkevičs R, Wolfertstetter PR, Klimiene U, Chin-Cheong K, Paschke A, Zerres J, Denzinger M, Niederberger D, Wellmann S, Ozkan E, Knorr C, Vogt JE (2024) Interpretable and intervenable ultrasonography-based machine learning models for pediatric appendicitis. Med Image Anal 91:103042. https://doi.org/10.5281/zenodo.7
Article PubMed Google Scholar
Emile SH, Ghareeb W, Elfeki H, El Sorogy M, Fouad A, Elrefai M (2022) Development and validation of an artificial intelligence-based model to predict gastroesophageal reflux disease after sleeve gastrectomy. Obes Surg 32:2537–2547. https://doi.org/10.1007/s11695-022-06112-x
Article PubMed PubMed Central Google Scholar
Ge Z, Wang B, Chang J, Yu Z, Zhou Z, Zhang J, Duan Z (2023) Using deep learning and explainable artificial intelligence to assess the severity of gastroesophageal reflux disease according to the los angeles classification system. Scand J Gastroenterol 58:596–604. https://doi.org/10.1080/00365521.2022.2163185
Article PubMed Google Scholar
Rahsepar AA, Tavakoli N, Kim GHJ, Hassani C, Abtin F, Bedayat A (2023) How AI responds to common lung cancer questions: ChatGPT vs google bard. Radiology 307:1–12. https://doi.org/10.1148/radiol.230922
Article Google Scholar
Bowman SR (2023) Eight things to know about large language models. arXiv 1–16.
Eysenbach G (2023) The role of ChatGPT, generative language models, and artificial intelligence in medical education: a conversation with ChatGPT and a call for papers. JMIR Med Educ 9:1–13
Article Google Scholar
Beaulieu-Jones BR, Shah S, Berrigan MT, Marwaha JS, Lai S-L, Brat GA (2024) Evaluating capabilities of large language models: performance of GPT4 on surgical knowledge assessments. Surgery 12:1–7. https://doi.org/10.1016/j.surg.2023.12.014
Article Google Scholar
Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, Madriaga M, Aggabao R, Diaz-Candido G, Maningo J, Tseng V (2023) Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2:e0000198. https://doi.org/10.1371/journal.pdig.0000198
Article PubMed PubMed Central Google Scholar
Smith DA (2020) Situating wikipedia as a health information resource in various contexts: a scoping review. PLoS ONE 15:1–19. https://doi.org/10.1371/journal.pone.0228786
Article CAS Google Scholar
Lee K, Hoti K, Hughes JD, Emmerton L (2014) Dr google and the consumer: a qualitative study exploring the navigational needs and online health information-seeking behaviors of consumers with chronic health conditions. J Med Internet Res 16:1–14. https://doi.org/10.2196/jmir.3706
Article Google Scholar
Ayoub NF, Lee Y-J, Grimm D, Balakrishnan K (2023) Comparison between ChatGPT and google search as sources of postoperative patient instructions. JAMA Otolaryngol Head Neck Surg 149:555–556
Article Google Scholar
Hristidis V, Ruggiano N, Brown EL, Ganta SRR, Stewart S (2023) ChatGPT vs google for queries related to dementia and other cognitive decline: comparison of results. J Med Internet Res 25:1–13. https://doi.org/10.2196/48966
Article Google Scholar
Mahajan A, Esper S, Oo TH, McKibben J, Garver M, Artman J, Klahre C, Ryan J, Sadhasivam S, Holder-Murray J, Marroquin OC (2023) Development and validation of a machine learning model to identify patients before surgery at high risk for postoperative adverse events. JAMA Netw Open 6:E2322285. https://doi.org/10.1001/jamanetworkopen.2023.22285
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors would like to thank the SAGES Guideline Committee for their expert guidance in the development of this manuscript.

Funding

This study received no funding.

Author information

Authors and Affiliations

Division of General Surgery, Department of Surgery, McMaster University, Hamilton, ON, Canada
Bright Huo
University of California South California, East Bay, Oakland, CA, USA
Elisa Calabrese
Division of Colon and Rectal Surgery, Department of Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Patricia Sylla
Department of General Surgery, Thomas Jefferson University Hospital, Philadelphia, PA, USA
Sunjay Kumar
Division of Pediatric Surgery/Department of Surgery, San Diego School of Medicine, University of California, California, CA, USA
Romeo C. Ignacio
Nacogdoches Center for Metabolic and Weight Loss Surgery, Nacogdoches, TX, USA
Rodolfo Oviedo
University of Houston Tilman J. Fertitta Family College of Medicine, Houston, TX, USA
Rodolfo Oviedo
Sam Houston State University College of Osteopathic Medicine, Conroe, TX, USA
Rodolfo Oviedo
University of Iowa, Iowa City, IA, USA
Imran Hassan
Department of Surgery, University of Chicago, Chicago, IL, USA
Bethany J. Slater
Division of Colorectal Surgery, Department of Surgery, City of Hope National Medical Center, Duarte, CA, USA
Andreas Kaiser
Department of Surgery, University of Kentucky, Lexington, KY, USA
Danielle S. Walsh
Department of Surgery, Harvard Medical School, Mount Auburn Hospital, Cambridge, MA, USA
Wesley Vosburg

Authors

Bright Huo
View author publications
You can also search for this author in PubMed Google Scholar
Elisa Calabrese
View author publications
You can also search for this author in PubMed Google Scholar
Patricia Sylla
View author publications
You can also search for this author in PubMed Google Scholar
Sunjay Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Romeo C. Ignacio
View author publications
You can also search for this author in PubMed Google Scholar
Rodolfo Oviedo
View author publications
You can also search for this author in PubMed Google Scholar
Imran Hassan
View author publications
You can also search for this author in PubMed Google Scholar
Bethany J. Slater
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Kaiser
View author publications
You can also search for this author in PubMed Google Scholar
Danielle S. Walsh
View author publications
You can also search for this author in PubMed Google Scholar
Wesley Vosburg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wesley Vosburg.

Ethics declarations

Disclosures

Walsh is Co-Chair of the Guidelines Committee for Society of Gastrointestinal and Endoscopic Surgeons. Walsh is a Member of the American College of Surgeons Health Information Technology Committee and the Board of Governors. Slater is a consultant for Cook Medical and Hologic. Slater is the Chair of the Guidelines Committee for Society of American Gastrointestinal and Endoscopic Surgeons (SAGES). Sylla is a consultant for Safeheal, Ethicon, Stryker and Tissium. Sylla is the president of SAGES. Huo, Calabrese, Kumar, Ignacio, Oviedo, Hassan, Kaiser, and Vosburg have no conflicts of interest to disclose.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 7760 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Huo, B., Calabrese, E., Sylla, P. et al. The performance of artificial intelligence large language model-linked chatbots in surgical decision-making for gastroesophageal reflux disease. Surg Endosc (2024). https://doi.org/10.1007/s00464-024-10807-w

Download citation

Received: 12 March 2024
Accepted: 21 March 2024
Published: 17 April 2024
DOI: https://doi.org/10.1007/s00464-024-10807-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The performance of artificial intelligence large language model-linked chatbots in surgical decision-making for gastroesophageal reflux disease