Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources

Srinivasan, Nitin; Samaan, Jamil S.; Rajeev, Nithya D.; Kanu, Mmerobasi U.; Yeo, Yee Hui; Samakar, Kamran

doi:10.1007/s00464-024-10720-2

Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources

Open access
Published: 12 March 2024

Volume 38, pages 2522–2532, (2024)
Cite this article

Download PDF

You have full access to this open access article

Surgical Endoscopy Aims and scope Submit manuscript

Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources

Download PDF

Nitin Srinivasan¹,
Jamil S. Samaan²,
Nithya D. Rajeev¹,
Mmerobasi U. Kanu¹,
Yee Hui Yeo² &
…
Kamran Samakar ORCID: orcid.org/0000-0002-1710-6245¹

1229 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Background

The readability of online bariatric surgery patient education materials (PEMs) often surpasses the recommended 6th grade level. Large language models (LLMs), like ChatGPT and Bard, have the potential to revolutionize PEM delivery. We aimed to evaluate the readability of PEMs produced by U.S. medical institutions compared to LLMs, as well as the ability of LLMs to simplify their responses.

Methods

Responses to frequently asked questions (FAQs) related to bariatric surgery were gathered from top-ranked health institutions. FAQ responses were also generated from GPT-3.5, GPT-4, and Bard. LLMs were then prompted to improve the readability of their initial responses. The readability of institutional responses, initial LLM responses, and simplified LLM responses were graded using validated readability formulas. Accuracy and comprehensiveness of initial and simplified LLM responses were also compared.

Results

Responses to 66 FAQs were included. All institutional and initial LLM responses had poor readability, with average reading levels ranging from 9th grade to college graduate. Simplified responses from LLMs had significantly improved readability, with reading levels ranging from 6th grade to college freshman. When comparing simplified LLM responses, GPT-4 responses demonstrated the highest readability, with reading levels ranging from 6th to 9th grade. Accuracy was similar between initial and simplified responses from all LLMs. Comprehensiveness was similar between initial and simplified responses from GPT-3.5 and GPT-4. However, 34.8% of Bard's simplified responses were graded as less comprehensive compared to initial.

Conclusion

Our study highlights the efficacy of LLMs in enhancing the readability of bariatric surgery PEMs. GPT-4 outperformed other models, generating simplified PEMs from 6th to 9th grade reading levels. Unlike GPT-3.5 and GPT-4, Bard’s simplified responses were graded as less comprehensive. We advocate for future studies examining the potential role of LLMs as dynamic and personalized sources of PEMs for diverse patient populations of all literacy levels.

Spanish-language bariatric surgery patient education materials fail to meet healthcare literacy standards of readability

Article 02 May 2023

Readability of patient education materials for bariatric surgery

Article 05 June 2023

Readability of online patient education material for foregut surgery

Article 15 July 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Bariatric surgery is an effective long-term treatment option for severe obesity and has been shown to significantly lower the risk of cardiovascular disease, malignancy, and other obesity-related comorbidities [1,2,3,4]. Despite its proven efficacy and safety, bariatric surgery is underutilized, with less than 1% of eligible patients undergoing the procedure [5]. Several factors, such as socioeconomic barriers, access to care, general perceptions of bariatric surgery, and notably low health literacy, may contribute to its underutilization [5, 6]. Furthermore, health literacy has been demonstrated to significantly impact both the utilization and outcomes of bariatric surgery [7,8,9].

The internet has become an essential medium for individuals seeking health information, as evidenced by a 2009 Pew Center survey revealing that 61% of U.S. adults searched for medical information online [10]. Specifically concerning bariatric surgery, studies show that approximately 50% [11] to 75% [12] of individuals considering weight loss surgery consult online resources. This information may impact patients’ decision to undergo surgery, with one study showing that 25% of patients decided on surgery mainly based on online information [12]. Furthermore, a significant number of patients continue to utilize the internet postoperatively for information [12]. Therefore, access to high-quality, easy-to-understand online patient education materials (PEMs) may be a promising intervention for improving utilization rates of bariatric surgery, as well as optimizing surgical outcomes.

Easy-to-understand PEMs are critical to ensuring comprehension among patients of all educational levels. The American Medical Association (AMA) has notably recommended that PEMs is written at the 6th grade reading level or lower [13]. However, previous studies have shown low readability among available online PEMs across multiple specialties, including bariatric surgery [14,15,16,17,18,19]. Consequently, the lack of readable PEMs may act as a barrier to patients who seek high-quality information from reliable sources, especially to those with low health literacy.

The advent and widespread adoption of large language models (LLMs) has the potential to revolutionize patient education and increase access to information across all fields of medicine. ChatGPT and Bard are two common LLMs used today that have gained unprecedented adoption by the public [20, 21]. These models were trained on a large dataset that helps them respond to queries in a comprehensible and conversational manner. There is a growing body of literature demonstrating the impressive ability of these models to answer clinical questions related to many fields of medicine, including bariatric surgery [22,23,24]. A recent study highlighted ChatGPT’s ability to answer bariatric surgery-related questions, where the model provided comprehensive responses to 86.8% (131/151) of questions with 90.7% reproducibility [24].

While the knowledge base of ChatGPT in bariatric surgery may be impressive, there are currently no studies examining the readability of content produced by LLMs related to bariatric surgery compared to available online PEMs. Thus, we examined the readability of PEMs produced by top-ranked medical institutions in the United States compared to PEMs produced by LLMs. Furthermore, we assessed the ability of LLMs to simplify their language in real time, as well as investigated the adaptability of LLMs to user-reading grade level.

Materials and methods

FAQ and institutional response curation

The Frequently Asked Questions (FAQ) pages of the American Society for Metabolic and Bariatric Surgery (ASMBS), top 10 hospitals listed in the U.S. News Best Hospitals for Gastroenterology & GI Surgery [25] and top 10 hospitals listed in the U.S. News 2022-2023 Best Hospitals Honor Roll [26] were reviewed. Questions were curated, screened, and approved by three authors (N.S., J.S., N.R.) to evaluate their inclusion in the study. Only questions related to bariatric surgery or weight loss surgery were included. Questions that were promotional (e.g., “Why should I consider weight loss surgery at [X institution]?”) or related to logistics (e.g., “How do I make an appointment for bariatric surgery [at X institution]?”) were excluded (Fig. 1). Questions that were vague were rephrased or grammatically modified to eliminate ambiguity (Fig. 1). Duplicate FAQs (i.e., the same FAQ asked by multiple institutions) were preserved in order to analyze the readability of their respective institutions’ individual answers.

ChatGPT and Bard

ChatGPT and Bard are LLMs that have been trained on extensive datasets from various sources, including websites, literature, and articles [27, 28]. Training data for ChatGPT are limited by information up to September 2021 [27, 29], while Bard does not have a fixed knowledge cutoff year [28]. GPT-3.5 was released in November of 2022; its successor, GPT-4, was released in March 2023 and is believed to have superior performance across multiple metrics [30]. Bard was also released in March 2023 [28]. When prompted with inquiries, these models can provide well-formulated, conversational, and easy-to-understand responses. The models were refined using reinforcement learning from human feedback (RLHF) to adhere to a wide range of commands and written instructions, with human preferences serving as a reward signal to fine-tune their responses [31, 32]. These models were also trained to align with user intentions and minimize bias or harmful responses. The specific sources of information used to train ChatGPT and Bard are not entirely known.

LLM response generation

To generate responses, each FAQ was prompted to GPT-3.5 and GPT-4 (version May 24th, 2023), as well as Bard (version June 7th, 2023). Each individual question was inputted once using the “new chat” function. After the model generated a response, we further prompted the model to simplify its response by asking “Can you explain that in simpler terms?” in the same chat. Thus, each FAQ received two recorded responses from each LLM: an initial response and a simplified response.

Question grading

To grade the readability of responses, we used a freely available online readability scoring tool (https://readabilityformulas.com/) that has been previously utilized in several studies [33,34,35,36,37]. This tool analyzes text using seven established and validated readability scoring systems: Flesch Reading Ease Formula (FRE), Gunning Fog Index (GFI), Flesch–Kincaid Grade Level (FKGL), Simplified Measure of Gobbledegook Index (SMOG), Coleman–Liau Index (CLI), Automatic Readability Index (ARI), and the Linsear Write Formula (LWF).

These scoring systems use a variety of parameters (e.g., sentence length, number of syllables, number of letters, number of words, etc.) to grade the readability of text provided. The FRE generates a score from 0 to 100 [38] (Supplementary Table 1), while the GFI [39], FKGL [40], SMOG [41], CLI [42], ARI [43], and LWF [44] generate a score corresponding to the U.S. school grade level at which an average student in that grade level can effectively read and understand the given material [45].

Across all responses, punctuation (e.g., periods, commas, exclamation points, etc.) and characters indicating that information is being presented in a list (e.g., bullet points, asterisks, dashes, numbers, etc.) were included in readability score calculations. Formatting characters, such as “**” (indicating bolded text), were excluded. For responses that contained information presented in tables, only the text and punctuation from these tables were included.

Accuracy and comprehensiveness

A board-certified, fellowship-trained active practice academic bariatric surgeon (K.S.) compared the accuracy and comprehensiveness of initial and simplified responses for each LLM using the following scales.

When comparing accuracy:

1.
The Simplified Response is more accurate than the Initial Response.
2.
The Simplified Response is equal in accuracy to the Initial Response.
3.
The Simplified Response is less accurate than the Initial Response.

When comparing comprehensiveness:

1.
The Simplified Response is more comprehensive than the Initial Response.
2.
The Simplified Response is equal in comprehensiveness to the Initial Response.
3.
The Simplified Response is less comprehensive than the Initial Response.

Statistical analysis

Descriptive analyses are presented as means and standard deviations (SD). Readability scores for answers to FAQs across institutions and LLMs were compared using Student’s t test. A p value less than 0.05 was considered statistically significant for all analyses. All analyses were conducted by author N.S. using Microsoft Excel (version 16.75), with statistical expertise provided by author Y.Y.

Institutional Review Board approval and written consent were not required for this study.

Results

Five institutions that contained bariatric surgery FAQ pages on their websites were included in our study. In combination with the ASMBS website, [46] we gathered a total of 69 FAQs; three questions were excluded, leaving 66 FAQs to be included in the study (Fig. 1, Supplementary Tables 2, 3 and 4). Individual readability scores associated with each institution (anonymized) and the ASMBS are presented in Supplementary Table 5.

Institutional and LLM responses

The mean FRE score of institutional responses was 48.1 (SD = 19.0), which corresponded to “difficult to read,” while initial responses from GPT-3.5, GPT-4.0 and Bard achieved mean scores of 31.4 (SD = 11.4), 42.7 (SD=9.7), and 56.3 (SD = 11.6), which corresponded to “difficult to read,” “difficult to read,” and “fairly difficult to read,” respectively (Table 1, Fig. 2). When examined by grade level, institutional response readability ranged from 10th grade to college sophomore (Table 2, Fig. 3). On the other hand, readability of initial LLM responses ranged from college freshman to college graduate for GPT-3.5, 12th grade to college senior for GPT-4, and 9th grade to college freshman for Bard (Table 2, Fig. 3).

Table 1 Comparison of Flesch Reading Ease Scores between institutional, initial LLM, and simplified LLM responses to bariatric surgery frequently asked questions

Full size table

Table 2 Comparison of reading grade levels between institutional, initial LLM and simplified LLM responses to bariatric surgery frequently asked questions

Full size table

Simplified responses from GPT-3.5, GPT-4.0, and Bard achieved mean FRE scores of 53.2 (SD = 10.7), 74.0 (SD = 7.2), and 62.8 (SD = 11.1), which corresponded to “fairly difficult to read,” “fairly easy to read,” and “plain English,” respectively (Table 1, Fig. 2). When examined by grade level, simplified response readability ranged from 10th grade to college freshman for GPT-3.5, 6th to 9th grade for GPT-4, and 8th to 12th grade for Bard (Table 2, Fig. 3).

Comparisons between institutions and LLMs

Institutions vs GPT-3.5

Initial responses from GPT-3.5 received a lower FRE score than that of institutions (p < 0.05) (Table 1, Fig. 2), as well as greater grade levels across all instruments (p < 0.05) (Table 2, Figure 3). Simplified responses, however, received a similar FRE score to that of institutions (p = 0.059) (Table 1, Fig. 2, Supplementary Table 6). When examining grade levels, GPT-3.5 provided simplified responses with similar readability to that of institutions across most instruments, except for FKGL (p < 0.05) and LWF (p < 0.05) (Table 2, Fig. 3).

Institutions vs. GPT-4

Initial responses from GPT-4 received a lower FRE score than that of institutions (p < 0.05) (Table 1, Fig. 2). When examining grade levels, GPT-4 provided responses with lower readability than that of institutions across most instruments (p < 0.05), except for FKGL (p = 0.142), ARI (p = 0.105), and LWF (p = 0.265) (Table 2, Fig. 3, Supplementary Table 6). Simplified responses from GPT-4 received a higher FRE than that of institutions (p < 0.05) (Table 1, Fig. 2), as well as lower grade levels across all instruments (p < 0.05) (Table 2, Fig. 3).

Institutions vs. Bard

Initial responses from Bard received a higher FRE score than that of institutions (p < 0.05) (Table 1, Fig. 2). When examining grade levels, Bard produced responses with higher readability than that of institutions across most instruments (p < 0.05), except for GFI (p = 0.160), SMOG (p = 0.285), and LWF (p = 0.543) (Table 2, Fig. 3, Supplementary Table 6). Simplified responses from Bard received a higher FRE score than that of institutions (Table 1, Fig. 2), as well as lower grade levels across all instruments (Table 2, Fig. 3) (p < 0.05).

Comparisons between LLMs

Simplified responses from GPT-3.5, GPT-4, and Bard received higher FRE scores, as well as lower grade levels across all instruments, compared to those of initial responses from GPT-3.5, GPT-4, and Bard, respectively (p < 0.05) (Table 3, Supplementary Table 7). Initial and simplified responses from GPT-4 received higher FRE scores, as well as lower grade levels across all instruments, compared to those of initial and simplified responses from GPT-3.5, respectively (p < 0.05) (Table 3, Supplementary Table 7). Initial and simplified responses from GPT-4 received higher FRE scores, as well as lower grade levels across all instruments, compared to those of initial and simplified responses from Bard, respectively (p < 0.05) (Table 3, Supplementary Table 7).

Table 3 Comparison of Readability Scores between initial LLM and simplified LLM responses to bariatric surgery frequently asked questions

Full size table

Accuracy and comprehensiveness of LLM responses

The majority of simplified LLM responses were rated as equal in accuracy to initial responses. The majority of simplified GPT-3.5 and GPT-4 responses (92.4% and 92.4%, respectively) were rated as equal in comprehensiveness to initial responses. However, 34.8% of simplified Bard responses were rated as less comprehensive than initial responses (Supplementary Table 8).

Discussion

Access to high-quality and easy-to-understand PEMs may better serve bariatric surgery patients and the public. We evaluated the readability of bariatric surgery PEMs from medical institutions compared to those generated by LLMs. We then evaluated the ability of LLMs to rephrase and improve the readability of their responses when prompted to do so. Finally, we compared the accuracy and comprehensiveness of initial and simplified LLM responses to determine the impact of simplification on content quality. Our analysis shows poor readability among all institutions as well as initial LLM responses, where average reading levels ranged from 9th grade to college graduate. When prompted to explain their initial responses in simpler terms, all LLMs generated significantly more readable text compared to their initial responses. Among all the LLMs, GPT-4 provided the most readable simplified responses, with reading levels ranging from 6th to 9th grade. Additionally, while GPT-4 and GPT-3.5 maintained high levels of accuracy and comprehensiveness with simplification, Bard demonstrated a notable decrease in comprehensiveness among 34.8% of its simplified responses, while maintaining accuracy. Our study highlights the ability of LLMs, especially GPT-4, to increase their output readability on demand, highlighting their potential in enhancing access to easy-to-understand PEMs for all patients considering and undergoing bariatric surgery. We also highlight variability in LLM performance regarding maintaining accuracy and comprehensiveness when simplifying PEMs, with GPT-3.5 and GPT-4.0 outperforming Bard.

Our analysis shows that institutional websites’ PEMs remain too complex for the public, falling short of the AMA recommendation that PEMs be written at a 6th grade level or below [13]. These findings echo the results of previous studies that showed poor readability of bariatric surgery PEMs online [47, 48]. Furthermore, initial responses from LLMs were also found to have poor readability and in some instances worse readability than the institutions. These findings are concerning, as multiple studies have found an association between lower health literacy and reduced short-term and long-term weight loss post-bariatric surgery [7,8,9]. Furthermore, other studies have demonstrated an association between lower health literacy and reduced medical appointment follow-up 1 year after surgery [49] as well as diminished likelihood of eventually undergoing the surgery itself [50].

Considering this, we also examined the ability of LLMs to rephrase their initial responses in simpler terms. GPT-4, when prompted to simplify its responses, demonstrated superior adaptability to reader grade level by generating responses with greater readability, compared to institutional, simplified GPT-3.5, and simplified Bard responses. Furthermore, simplified GPT-4 responses met the AMA recommendation [13] for 2 out of the 6 readability instruments, with “fairly easy to read” readability based on the FRE scale (Table 2). Our findings demonstrate the ability of LLMs, particularly GPT-4, to simplify language in real time when prompted to do so. This may be valuable for patients seeking information from LLMs or healthcare providers, who may utilize this technology to improve the readability of their existing PEMs. The superior performance of GPT-4 over GPT-3.5 also highlights the rapid improvement in model performance with each iteration in a short period of time, further bolstering the potential of these models in the future.

The notable decrease in comprehensiveness of Bard responses when simplified highlights a critical issue regarding the balance between readability and content quality produced by LLMs, especially in the context of PEMs. While enhancing the readability of health information is an important goal, it is critical that we consider how the process of oversimplification may inadvertently impact PEM quality. Encouragingly, GPT-4 and GPT-3.5 maintained both accuracy and comprehensiveness, highlighting a potential area of improvement for Bard. We recommend further evaluation of the accuracy and comprehensiveness of rephrased or simplified PEMs in future studies, given the discrepancies in performance found in our analysis. Future iterations of LLMs should ensure that increased readability does not compromise the quality of PEMs delivered to patients.

The discrepancy in performance across the multiple LLMs evaluated in our research also underscores the need for a comprehensive discourse on the ability of LLMs to generate easy-to-understand materials for patients in the healthcare sector. This point is especially salient in light of the rapid evolution and roll-out of new LLMs (Llama 2.0, Med-PaLM 2), highlighting the urgency to ensure readability standards. The utilization of LLMs in healthcare necessitates a balance between sophisticated clinical vernacular and personalized patient-centered delivery of information. Models that generate language beyond the comprehension of the average patient may engender confusion, which may exacerbate existing health literacy disparities and potentially compromise the quality of healthcare. Thus, it is essential for these advanced models to optimize their output for readability and comprehension, thereby elevating the standard of patient-centered care and harnessing the full potential of artificial intelligence in medicine.

Limitations and future directions

The readability assessment tools we selected for our study are widely recognized and utilized [33,34,35,36,37]. However, they possess inherent limitations, focusing predominantly on quantifiable aspects of text complexity such as sentence length and syllable count, rather than qualitative aspects such as subject familiarity, conceptual difficulty, and context. These tools also do not consider the popularity of certain words and phrases, which can significantly affect the readability of a given text. For example, while the Gunning Fog scale accounts for the number of syllables, it does not recognize that not all multisyllabic words are inherently complex [51] if they are familiar to the reader (e.g., the word “responsibility” has six syllables). The formulas also do not evaluate the organizational structure or layout of a text, which can significantly impact its navigability. Furthermore, our study also revealed that the assigned grade level for a text varies based on the assessment tool used, which may limit their reliability. Overall, while these formulas offer a standardized approach to assessing text readability, they do not account for the entire spectrum of factors that contribute to the ease of comprehension [52]. Looking forward, we encourage multifaceted approaches to readability studies and hope that more sophisticated tools that measure all aspects of text comprehension are developed in the near future.

The LLMs also have limitations that are currently under investigation. The sources of datasets used to train ChatGPT and Bard are largely unknown. Both OpenAI and Google acknowledge that the current versions of their respective LLMs may produce inaccurate information but hope to improve their performance via user feedback and model adjustments with future iterations. We hope that these constraints will diminish with ongoing enhancements to these models, resulting in even more accurate and consistent responses over time.

Conclusion

Our study highlights the potential of large language models, particularly GPT-4, to enhance the readability of patient education materials related to bariatric surgery, aligning more closely with recommended readability standards. The ability of LLMs to adapt and simplify language in real time underscores their potential to democratize access to high-quality easy-to-understand medical information. Our study also revealed that the simplification of PEMs by LLMs may impact their quality. While all LLMs significantly improved the readability of PEMs, the comprehensiveness of simplified responses varied, underscoring the importance of evaluating both the readability and quality of PEMs generated by LLMs. The rapid evolution of these models, as evidenced by the superior performance of GPT-4 over GPT-3.5, emphasizes the urgency to harness their full potential in the healthcare sector. We recommend future investigation of the integration of artificial intelligence in patient-centered care, which will pave the way for more accessible and personalized approaches to medicine in the future.

References

Buchwald H, Williams SE (2004) Bariatric surgery worldwide 2003. Obes Surg 14(9):1157–1164. https://doi.org/10.1381/0960892042387057
Article PubMed Google Scholar
Christou NV, Sampalis JS, Liberman M et al (2004) Surgery decreases long-term mortality, morbidity, and health care use in morbidly obese patients. Ann Surg 240(3):416–423. https://doi.org/10.1097/01.sla.0000137343.63376.19. (discussion 423–424)
Article PubMed PubMed Central Google Scholar
Pories WJ, Swanson MS, MacDonald KG et al (1995) Who would have thought it? An operation proves to be the most effective therapy for adult-onset diabetes mellitus. Ann Surg 222(3):339–350. https://doi.org/10.1097/00000658-199509000-00011. (discussion 350–352)
Article CAS PubMed PubMed Central Google Scholar
Sjöström L, Lindroos AK, Peltonen M et al (2004) Lifestyle, diabetes, and cardiovascular risk factors 10 years after bariatric surgery. N Engl J Med. 351(26):2683–2693. https://doi.org/10.1056/NEJMoa035622
Article PubMed Google Scholar
Martin M, Beekley A, Kjorstad R, Sebesta J (2010) Socioeconomic disparities in eligibility and access to bariatric surgery: a national population-based analysis. Surg Obes Relat Dis 6(1):8–15. https://doi.org/10.1016/j.soard.2009.07.003
Article PubMed Google Scholar
Rajeev ND, Samaan JS, Premkumar A, Srinivasan N, Yu E, Samakar K (2023) Patient and the public’s perceptions of bariatric surgery: a systematic review. J Surg Res 283:385–406. https://doi.org/10.1016/j.jss.2022.10.061
Article PubMed Google Scholar
Mahoney ST, Strassle PD, Farrell TM, Duke MC (2019) Does lower level of education and health literacy affect successful outcomes in bariatric surgery? J Laparoendosc Adv Surg Tech A 29(8):1011–1015. https://doi.org/10.1089/lap.2018.0806
Article PubMed Google Scholar
Erdogdu UE, Cayci HM, Tardu A, Demirci H, Kisakol G, Guclu M (2019) Health literacy and weight loss after bariatric surgery. Obes Surg 29(12):3948–3953. https://doi.org/10.1007/s11695-019-04060-7
Article PubMed Google Scholar
Miller-Matero LR, Hecht L, Patel S, Martens KM, Hamann A, Carlin AM (2021) The influence of health literacy and health numeracy on weight loss outcomes following bariatric surgery. Surg Obes Relat Dis 17(2):384–389. https://doi.org/10.1016/j.soard.2020.09.021
Article PubMed Google Scholar
Pew Research Center (2009) The social life of health information. https://www.pewresearch.org/internet/2009/06/11/the-social-life-of-health-information/. Accessed August 10, 2023
Makar B, Quilliot D, Zarnegar R et al (2008) What is the quality of information about bariatric surgery on the internet? Obes Surg 18(11):1455–1459. https://doi.org/10.1007/s11695-008-9507-x
Article CAS PubMed Google Scholar
Paolino L, Genser L, Fritsch S, De’ Angelis N, Azoulay D, Lazzati A (2015) The web-surfing bariatic patient: the role of the internet in the decision-making process. Obes Surg 25(4):738–743. https://doi.org/10.1007/s11695-015-1578-x
Article PubMed Google Scholar
Weiss BD (2003) Health literacy: a manual for clinicians. American Medical Association. http://lib.ncfh.org/pdfs/6617.pdf. Accessed July 29, 2023
Hansberry DR, Agarwal N, Shah R et al (2014) Analysis of the readability of patient education materials from surgical subspecialties. Laryngoscope 124(2):405–412. https://doi.org/10.1002/lary.24261
Article PubMed Google Scholar
Lee KC, Berg ET, Jazayeri HE, Chuang SK, Eisig SB (2019) Online patient education materials for orthognathic surgery fail to meet readability and quality standards. J Oral Maxillofac Surg 77(1):180.e1-180.e8. https://doi.org/10.1016/j.joms.2018.08.033
Article PubMed Google Scholar
Gutterman SA, Schroeder JN, Jacobson CE, Obeid NR, Suwanabol PA (2023) Examining the accessibility of online patient materials for bariatric surgery. Obes Surg 33(3):975–977. https://doi.org/10.1007/s11695-022-06440-y
Article PubMed Google Scholar
Rouhi AD, Ghanem YK, Hoeltzel GD et al (2023) Quality and readability of online patient information on adolescent bariatric surgery. Obes Surg 33(1):397–399. https://doi.org/10.1007/s11695-022-06385-2
Article PubMed Google Scholar
Daraz L, Morrow AS, Ponce OJ et al (2019) Can patients trust online health information? A meta-narrative systematic review addressing the quality of health information on the Internet. J Gen Intern Med 34(9):1884–1891. https://doi.org/10.1007/s11606-019-05109-0
Article PubMed PubMed Central Google Scholar
Meleo-Erwin Z, Basch C, Fera J, Ethan D, Garcia P (2019) Readability of online patient-based information on bariatric surgery. Health Promot Perspect 9(2):156–160. https://doi.org/10.15171/hpp.2019.22
Article PubMed PubMed Central Google Scholar
Mollman S (2022) ChatGPT gained 1 million users in under a week. Here’s why the AI chatbot is primed to disrupt search as we know it. Yahoo! Finance. https://finance.yahoo.com/news/chatgpt-gained-1-million-followers-224523258.html. Accessed August 8, 2023
Carr DF (2023) ChatGPT growth flattened in May; Google Bard up 187%. Similarweb. https://www.similarweb.com/blog/insights/ai-news/chatgpt-bard/. Accessed August 8, 2023
Sarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L (2023) Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model. JAMA 329(10):842–844. https://doi.org/10.1001/jama.2023.1044
Article PubMed PubMed Central Google Scholar
Yeo YH, Samaan JS, Ng WH et al (2023) Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol 29(3):721–732. https://doi.org/10.3350/cmh.2023.0089
Article PubMed PubMed Central Google Scholar
Samaan JS, Yeo YH, Rajeev N et al (2023) Assessing the accuracy of responses by the language model ChatGPT to questions regarding bariatric surgery. Obes Surg 33(6):1790–1796. https://doi.org/10.1007/s11695-023-06603-5
Article PubMed PubMed Central Google Scholar
U.S. News and World Report (n.d.) The best hospitals for gastroenterology and GI surgery. https://health.usnews.com/best-hospitals/rankings/gastroenterology-and-gi-surgery. Accessed July 25, 2023
Harder N (2023) America’s best hospitals: the 2022-2023 honor roll and overview. US News & World Report. https://health.usnews.com/health-care/best-hospitals/articles/best-hospitals-honor-roll-and-overview. Accessed July 25, 2023
Open AI (2022) Introducing ChatGPT. https://openai.com/blog/chatgpt. Accessed July 25, 2023
Hsiao S, Collins E (2023) Try Bard and share your feedback. Google. https://blog.google/technology/ai/try-bard/. Accessed August 9, 2023
OpenAI (n.d.) OpenAI platform. https://platform.openai.com. Accessed September 24, 2023
OpenAI (2023) GPT-4 technical report. https://doi.org/10.48550/arXiv.2303.08774
Ouyang L, Wu J, Jiang X et al (2022) Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst. https://doi.org/10.48550/arXiv.2203.02155
Article Google Scholar
Manyika J (2023) An overview of Bard: an early experiment with generative AI. Google
Herbert AS, Nemirovsky A, Hess DS et al (2021) An evaluation of the readability and content-quality of pelvic organ prolapse YouTube transcripts. Urology 154:120–126. https://doi.org/10.1016/j.urology.2021.03.009
Article PubMed Google Scholar
Fischer AE, Venter WDF, Collins S, Carman M, Lalla-Edward ST (2021) The readability of informed consent forms for research studies conducted in South Africa. South Afr Med J Suid-Afr Tydskr Vir Geneeskd 111(2):180–183. https://doi.org/10.7196/SAMJ.2021.v111i2.14752
Article CAS Google Scholar
O’Callaghan C, Rogan P, Brigo F, Rahilly J, Kinney M (2021) Readability of online sources of information regarding epilepsy surgery and its impact on decision-making processes. Epilepsy Behav 121(Pt A):108033. https://doi.org/10.1016/j.yebeh.2021.108033
Article PubMed Google Scholar
Rayess H, Zuliani GF, Gupta A et al (2017) Critical analysis of the quality, readability, and technical aspects of online information provided for neck-lifts. JAMA Facial Plast Surg 19(2):115–120. https://doi.org/10.1001/jamafacial.2016.1219
Article PubMed Google Scholar
Azer SA, Alsharafi AA (2023) Can pharmacy students use Wikipedia as a learning resource? Critical assessment of articles on chemotherapeutic drugs. Adv Physiol Educ 47(2):333–345. https://doi.org/10.1152/advan.00212.2022
Article PubMed Google Scholar
Flesch R (2016) Guide to academic writing. University of Canterbury School of Business and Economics. https://web.archive.org/web/20160712094308/http://www.mang.canterbury.ac.nz/writing_guide/writing/flesch.shtml. Accessed July 26, 2023
Gunning R (1969) The Fog Index after twenty years. https://doi.org/10.1177/002194366900600202. Accessed August 8, 2023
Kincaid J, Fishburne R, Rogers R, Chissom B (1975) Derivation of new readability formulas (Automated Readability Index, Fog Count And Flesch Reading Ease Formula) for Navy enlisted personnel. Inst Simul Train. https://stars.library.ucf.edu/istlibrary/56
McLaughlin GH (1969) SMOG grading: a new readability formula. J Read 12(8):639–646
Google Scholar
Coleman M, Liau TL (1975) A computer readability formula designed for machine scoring. J Appl Psychol 60(2):283–284. https://doi.org/10.1037/h0076540
Article Google Scholar
Smith EA, Senter RJ (1967) Automated readability index. Aerospace Medical Research Laboratories, Aerospace Medical Division, Air Force Systems Command. pp 1–14
Klare GR (1974) Assessing readability. Read Res Q 10(1):62–102. https://doi.org/10.2307/747086
Article Google Scholar
Automatic Readability Checker, a free readability formula consensus calculator. https://readabilityformulas.com/free-readability-formula-tests.php. Accessed July 26, 2023
American Society for Metabolic and Bariatric Surgery (2020) Bariatric surgery FAQs. https://asmbs.org/patients/faqs-of-bariatric-surgery. Accessed July 28, 2023
Lucy AT, Rakestraw SL, Stringer C et al (2023) Readability of patient education materials for bariatric surgery. Surg Endosc 37(8):6519–6525. https://doi.org/10.1007/s00464-023-10153-3
Article PubMed Google Scholar
Padilla G, Awshah S, Mhaskar RS et al (2023) Spanish-language bariatric surgery patient education materials fail to meet healthcare literacy standards of readability. Surg Endosc 37(8):6417–6428. https://doi.org/10.1007/s00464-023-10088-9
Article PubMed Google Scholar
Hecht LM, Martens KM, Pester BD, Hamann A, Carlin AM, Miller-Matero LR (2022) Adherence to medical appointments among patients undergoing bariatric surgery: do health literacy, health numeracy, and cognitive functioning play a role? Obes Surg 32(4):1391–1393. https://doi.org/10.1007/s11695-022-05905-4
Article PubMed Google Scholar
Hecht L, Cain S, Clark-Sienkiewicz SM et al (2019) Health literacy, health numeracy, and cognitive functioning among bariatric surgery candidates. Obes Surg 29(12):4138–4141. https://doi.org/10.1007/s11695-019-04149-z
Article PubMed Google Scholar
Scott B (2023) The Gunning’s Fog Index (or FOG) Readability Formula. ReadabilityFormulas.com. https://readabilityformulas.com/the-gunnings-fog-index-or-fog-readability-formula/. Accessed September 24, 2023
Agency for Healthcare Research and Quality (2015) Tip 6. Use caution with readability formulas for quality reports. Agency for Healthcare Research and Quality. https://www.ahrq.gov/talkingquality/resources/writing/tip6.html. Accessed July 29, 2023

Download references

Acknowledgements

None.

Funding

Open access funding provided by SCELC, Statewide California Electronic Library Consortium.

Author information

Authors and Affiliations

Division of Upper GI and General Surgery, Keck School of Medicine of USC, 1510 San Pablo St HCC 3, Los Angeles, CA, 90033, USA
Nitin Srinivasan, Nithya D. Rajeev, Mmerobasi U. Kanu & Kamran Samakar
Karsh Division of Gastroenterology and Hepatology, Department of Medicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA
Jamil S. Samaan & Yee Hui Yeo

Authors

Nitin Srinivasan
View author publications
You can also search for this author in PubMed Google Scholar
Jamil S. Samaan
View author publications
You can also search for this author in PubMed Google Scholar
Nithya D. Rajeev
View author publications
You can also search for this author in PubMed Google Scholar
Mmerobasi U. Kanu
View author publications
You can also search for this author in PubMed Google Scholar
Yee Hui Yeo
View author publications
You can also search for this author in PubMed Google Scholar
Kamran Samakar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kamran Samakar.

Ethics declarations

Disclosures

Nitin Srinivasan, Dr. Jamil S. Samaan, Ms. Nithya D. Rajeev, Mr. Mmerobasi U. Kanu, Dr. Yee Hui Yeo, and Dr. Kamran Samakar have no conflicts of interest or financial ties to disclose. No authors received any specific grant from funding agencies in the public, commercial, or not-for-profit sectors that present a conflict of interest with this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 13 KB)

Supplementary file2 (DOCX 123 KB)

Supplementary file3 (DOCX 111 KB)

Supplementary file4 (DOCX 121 KB)

Supplementary file5 (DOCX 8 KB)

Supplementary file6 (DOCX 9 KB)

Supplementary file7 (DOCX 16 KB)

Supplementary file8 (DOCX 7 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Srinivasan, N., Samaan, J.S., Rajeev, N.D. et al. Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources. Surg Endosc 38, 2522–2532 (2024). https://doi.org/10.1007/s00464-024-10720-2

Download citation

Received: 04 October 2023
Accepted: 28 January 2024
Published: 12 March 2024
Issue Date: May 2024
DOI: https://doi.org/10.1007/s00464-024-10720-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources

Abstract

Background

Methods

Results

Conclusion

Similar content being viewed by others

Materials and methods

FAQ and institutional response curation

ChatGPT and Bard

LLM response generation

Question grading

Accuracy and comprehensiveness

Statistical analysis

Results

Institutional and LLM responses

Comparisons between institutions and LLMs

Institutions vs GPT-3.5

Institutions vs. GPT-4

Institutions vs. Bard

Comparisons between LLMs

Accuracy and comprehensiveness of LLM responses

Discussion

Limitations and future directions

Conclusion

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Disclosures

Additional information

Publisher's Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation