Avoid common mistakes on your manuscript.
Introduction
Bariatric surgery is an effective and safe treatment for severe obesity [1]. Accurate and comprehensive perioperative education is integral to patients’ surgical journeys and outcomes. Large language models (LLMs), like ChatGPT, have the potential to revolutionize patient education by leveraging vast quantities of data to respond to user prompts in an easy-to-understand and conversational manner. Released by OpenAI in November of 2022, GPT-3.5 acquired 1 million users within 5 days of its release, outpacing applications such as Facebook, Twitter, and Instagram [2]. By January of 2023, its user base reached 100 million monthly active users, making it the fastest growing consumer application in history [3]. Our recent study demonstrated the impressive ability of GPT-3.5 in answering questions related to bariatric surgery, showing high accuracy, comprehensiveness, and reproducibility of responses [4]. GPT-3.5’s successor, GPT-4, was released in March of 2023 with improvements in performance across multiple domains [5,6,7,8]. The current study builds on our previous analysis by comparing the accuracy and comprehensiveness of GPT-4 compared to GPT-3.5, in answering questions related to bariatric surgery.
Methods
A total of 151 questions related to bariatric surgery sourced from healthcare institutions, and Facebook support groups were included. The methodology for question curation is described in our previous study [4]. To better characterize ChatGPT’s performance, questions were organized into 5 categories: (1) “eligibility, efficacy, and procedure options”, (2) “preoperative preparation”, (3) “recovery, risks, and complications”, (4) “lifestyle changes”, and (5) “other”.
Response Generation and Grading
Each question was entered independently into both GPT-3.5 and GPT-4 in July 2023 using the “New Chat” function on the OpenAI platform. Differences in accuracy and comprehensiveness of responses between GPT-3.5 and GPT-4 were graded by a board-certified, fellowship-trained, bariatric surgeon practicing in a tertiary and quaternary referral center with over 10 years of experience. The scale used for independent grading of accuracy and comprehensiveness was as follows: Compared to the response from GPT-3.5, the response from GPT-4 is:
-
1)
Less accurate/comprehensiveness
-
2)
Similar accuracy/ comprehensiveness
-
3)
More accurate/comprehensive
Statistical analysis consisted of descriptive analysis summarizing proportions and percentages of responses earning each grade. All statistical analyses were performed in Microsoft Excel (Version 16.69.1).
Results
A total of 151 questions were included in our analysis (Supplementary Table 1). The majority of responses were graded as similar in accuracy between the two models. Of the total 151 responses from GPT-4, 3 (3.3%) were graded as less accurate, 133 (88.1%) as similar in accuracy, and 13 (8.6%) as more accurate compared to GPT-3.5 (Table 1, Fig. 1). A more notable difference in responses was observed when examining the comprehensiveness between the two models. A total of 15/151 (9.9%) of GPT-4’s responses were graded as less comprehensive, 81/151 (53.6%) as similar comprehensiveness, and 55/151 (36.4%) as more comprehensive compared to GPT-3.5 (Table 1, Fig. 1).
Conclusion
We present a follow up analysis comparing the accuracy and comprehensiveness of responses from GPT-3.5 and GPT-4 to questions related to bariatric surgery. In terms of accuracy, our results show a largely uniform performance between the two models with a significant majority (88.1%) of responses graded as having similar accuracy. These findings may suggest a degree of stability and reliability among the core algorithms when it comes to the generation of accurate responses. It is important to note that both models have been undergoing continuous refinement and updating, which may explain this comparability in performance. A more striking differentiation was observed when examining the comprehensiveness of responses. While over half of the responses (53.6%) had similar levels of comprehensiveness between the two models, a considerable number (36.4%) of GPT-4’s responses were found to be more comprehensive compared to GPT-3.5. This could be attributed to the enhanced training methodologies and an expanded data set in GPT-4, allowing for more context-rich and detailed answers [5]. For example, in "Preoperative Preparation," GPT-4 provided an extensive list of pre-surgical dietary guidelines as well as psychosocial considerations that were absent in GPT-3.5’s response. It’s notable that GPT-4 provided less comprehensive and accurate responses compared to GPT-3.5 for some questions. This discrepancy in performance for a minority of questions may be due to multiple reasons including model training, training data, and the nature of LLMs which generate text based on probabilities, leading to variation in performance on some occasions. Our study design was pragmatic in that question input mirrored how a user with no technological background may use an LLM. Therefore, advanced prompting strategies may minimize the variation in performance of LLMs and improve overall performance, a topic that would benefit from investigation in future studies.
Limitations and Future Directions
Our study is not without its limitations. First, the grading of responses was carried out by a single reviewer, which is subjective in nature despite the reviewer’s extensive experience. The list of questions used in our study is not comprehensive of all possible patient questions related to bariatric surgery and therefore may not be generalizable to ChatGPTs responses to all possible information regarding bariatric surgery.
In conclusion, GPT-3.5 and GPT-4 demonstrated relatively similar ability to generate accurate responses to bariatric surgery-related questions. However, GPT-4 provided more comprehensive responses to 36.4% of questions, demonstrating a significant improvement in model performance with iterations of the ChatGPT model. It’s important to note that both models provided inaccurate information, and therefore we advocate for their potential future role as adjunct sources of information to medical advice provided by licensed healthcare professionals. Our analysis suggests a steady increase in the robustness of large language models in providing accurate and comprehensive medical information. These improvements may be significant in future iterations and warrant further studies to examine their impact on clinical outcomes in bariatric surgery.
References
Arterburn DE, Telem DA, Kushner RF, Courcoulas AP. Benefits and risks of bariatric surgery in adults: a review. JAMA. 2020;324(9):879. https://doi.org/10.1001/jama.2020.12567.
Mollman S. ChatGPT gained 1 million users in under a week. Here’s why the AI chatbot is primed to disrupt search as we know it. Yahoo News. https://www.yahoo.com/video/chatgpt-gained-1-million-followers-224523258.html?guccounter=1&guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_sig=AQAAAKFfXrTV5yin411k7gddGbnGomYatqe2K486I97HLX57I0LxfcM1bWIV6uJJMOteBHmPz5tlfex1ENAY00dVAP_DxnATz8V_nb19AV4x1-TZKq0ZMn_t4A3aDGWKw86Iyto06MbWlQBEV9mDTtpbuZhGI1crpD3TFtqL2_rwKZaB. Published December 9, 2022. Accessed 26 July 2023.
Krystal H. ChatGPT sets record for fastest-growing user base - analyst note. Reuters. https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/. Published February 2, 2023. Accessed 26 July 2023.
Samaan JS, Yeo YH, Rajeev N, et al. Assessing the accuracy of responses by the language model ChatGPT to questions regarding bariatric surgery. Obes Surg. Published online April 27, 2023. https://doi.org/10.1007/s11695-023-06603-5.
OpenAI. GPT-4 technical report. Published online 2023. https://doi.org/10.48550/ARXIV.2303.08774.
Yeo YH, Samaan JS, Ng WH, et al. GPT-4 outperforms ChatGPT in answering non-English questions related to cirrhosis. Gastroenterology. 2023. https://doi.org/10.1101/2023.05.04.23289482.
Kaneda Y, Takahashi R, Kaneda U, et al. Assessing the performance of GPT-3.5 and GPT-4 on the 2023 Japanese nursing examination. Cureus. 2023;15(8):e42924. https://doi.org/10.7759/cureus.42924.
Currie G, Robbie S, Tually P. ChatGPT and patient information in nuclear medicine: GPT-3.5 versus GPT-4. J Nucl Med Technol. Published online September 12, 2023. https://doi.org/10.2967/jnmt.123.266151.
Funding
Open access funding provided by SCELC, Statewide California Electronic Library Consortium
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Key Points
• GPT-4 provided accurate and comprehensive responses to questions related to bariatric surgery
• GPT-4 and GPT-3.5 provided responses that are relatively comparable in accuracy
• GPT-4 provided more comprehensive responses compared to GPT-3.5
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Samaan, J.S., Rajeev, N., Ng, W.H. et al. ChatGPT as a Source of Information for Bariatric Surgery Patients: a Comparative Analysis of Accuracy and Comprehensiveness Between GPT-4 and GPT-3.5. OBES SURG 34, 1987–1989 (2024). https://doi.org/10.1007/s11695-024-07212-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11695-024-07212-6