Introduction

Bariatric surgery is an effective and safe treatment for severe obesity [1]. Accurate and comprehensive perioperative education is integral to patients’ surgical journeys and outcomes. Large language models (LLMs), like ChatGPT, have the potential to revolutionize patient education by leveraging vast quantities of data to respond to user prompts in an easy-to-understand and conversational manner. Released by OpenAI in November of 2022, GPT-3.5 acquired 1 million users within 5 days of its release, outpacing applications such as Facebook, Twitter, and Instagram [2]. By January of 2023, its user base reached 100 million monthly active users, making it the fastest growing consumer application in history [3]. Our recent study demonstrated the impressive ability of GPT-3.5 in answering questions related to bariatric surgery, showing high accuracy, comprehensiveness, and reproducibility of responses [4]. GPT-3.5’s successor, GPT-4, was released in March of 2023 with improvements in performance across multiple domains [5,6,7,8]. The current study builds on our previous analysis by comparing the accuracy and comprehensiveness of GPT-4 compared to GPT-3.5, in answering questions related to bariatric surgery.

Methods

A total of 151 questions related to bariatric surgery sourced from healthcare institutions, and Facebook support groups were included. The methodology for question curation is described in our previous study [4]. To better characterize ChatGPT’s performance, questions were organized into 5 categories: (1) “eligibility, efficacy, and procedure options”, (2) “preoperative preparation”, (3) “recovery, risks, and complications”, (4) “lifestyle changes”, and (5) “other”.

Response Generation and Grading

Each question was entered independently into both GPT-3.5 and GPT-4 in July 2023 using the “New Chat” function on the OpenAI platform. Differences in accuracy and comprehensiveness of responses between GPT-3.5 and GPT-4 were graded by a board-certified, fellowship-trained, bariatric surgeon practicing in a tertiary and quaternary referral center with over 10 years of experience. The scale used for independent grading of accuracy and comprehensiveness was as follows: Compared to the response from GPT-3.5, the response from GPT-4 is:

  1. 1)

    Less accurate/comprehensiveness

  2. 2)

    Similar accuracy/ comprehensiveness

  3. 3)

    More accurate/comprehensive

Statistical analysis consisted of descriptive analysis summarizing proportions and percentages of responses earning each grade. All statistical analyses were performed in Microsoft Excel (Version 16.69.1).

Results

A total of 151 questions were included in our analysis (Supplementary Table 1). The majority of responses were graded as similar in accuracy between the two models. Of the total 151 responses from GPT-4, 3 (3.3%) were graded as less accurate, 133 (88.1%) as similar in accuracy, and 13 (8.6%) as more accurate compared to GPT-3.5 (Table 1, Fig. 1). A more notable difference in responses was observed when examining the comprehensiveness between the two models. A total of 15/151 (9.9%) of GPT-4’s responses were graded as less comprehensive, 81/151 (53.6%) as similar comprehensiveness, and 55/151 (36.4%) as more comprehensive compared to GPT-3.5 (Table 1, Fig. 1).

Table 1 Accuracy and comprehensiveness of responses generated by GPT-4.0 compared to GPT-3.5 to questions related to bariatric surgery stratified by question category
Fig. 1
figure 1

Accuracy and comprehensiveness of responses generated by GPT-4.0 compared to GPT-3.5 to questions related to bariatric surgery

Conclusion

We present a follow up analysis comparing the accuracy and comprehensiveness of responses from GPT-3.5 and GPT-4 to questions related to bariatric surgery. In terms of accuracy, our results show a largely uniform performance between the two models with a significant majority (88.1%) of responses graded as having similar accuracy. These findings may suggest a degree of stability and reliability among the core algorithms when it comes to the generation of accurate responses. It is important to note that both models have been undergoing continuous refinement and updating, which may explain this comparability in performance. A more striking differentiation was observed when examining the comprehensiveness of responses. While over half of the responses (53.6%) had similar levels of comprehensiveness between the two models, a considerable number (36.4%) of GPT-4’s responses were found to be more comprehensive compared to GPT-3.5. This could be attributed to the enhanced training methodologies and an expanded data set in GPT-4, allowing for more context-rich and detailed answers [5]. For example, in "Preoperative Preparation," GPT-4 provided an extensive list of pre-surgical dietary guidelines as well as psychosocial considerations that were absent in GPT-3.5’s response. It’s notable that GPT-4 provided less comprehensive and accurate responses compared to GPT-3.5 for some questions. This discrepancy in performance for a minority of questions may be due to multiple reasons including model training, training data, and the nature of LLMs which generate text based on probabilities, leading to variation in performance on some occasions. Our study design was pragmatic in that question input mirrored how a user with no technological background may use an LLM. Therefore, advanced prompting strategies may minimize the variation in performance of LLMs and improve overall performance, a topic that would benefit from investigation in future studies.

Limitations and Future Directions

Our study is not without its limitations. First, the grading of responses was carried out by a single reviewer, which is subjective in nature despite the reviewer’s extensive experience. The list of questions used in our study is not comprehensive of all possible patient questions related to bariatric surgery and therefore may not be generalizable to ChatGPTs responses to all possible information regarding bariatric surgery.

In conclusion, GPT-3.5 and GPT-4 demonstrated relatively similar ability to generate accurate responses to bariatric surgery-related questions. However, GPT-4 provided more comprehensive responses to 36.4% of questions, demonstrating a significant improvement in model performance with iterations of the ChatGPT model. It’s important to note that both models provided inaccurate information, and therefore we advocate for their potential future role as adjunct sources of information to medical advice provided by licensed healthcare professionals. Our analysis suggests a steady increase in the robustness of large language models in providing accurate and comprehensive medical information. These improvements may be significant in future iterations and warrant further studies to examine their impact on clinical outcomes in bariatric surgery.