FormalPara Key Summary Points

Why carry out this study?

Since their introduction in 2022, artificial intelligence dialogue platforms have influenced many different aspects of healthcare, including efforts to improve health literacy among patients.

The present study sought to use artificial intelligence dialogue platforms to rewrite patient education materials on aortic stenosis from leading academic institutions to meet recommended reading skill levels for the public.

What was learned from the study?

Two artificial intelligence chatbots, ChatGPT and Bard, successfully improved the readability of 21 patient education materials on aortic stenosis, although neither platform reached recommended 6th-grade reading skill levels.

Virtual assistants such as recent artificial intelligence platforms demonstrate potential to improve healthcare communication, rendering existing educational resources more accessible and understandable for patients.

Introduction

The advent of generative artificial intelligence (AI) platforms has dramatically influenced many aspects of modern medicine. Large language models (LLM), or models powered by generative AI platforms to understand and generate human language, have recently been deployed for knowledge retrieval, clinical decision support, and documentation in health systems worldwide [1, 2]. AI-powered LLMs have also been proposed as potential tools to aid ongoing efforts in improving patient education materials (PEMs) and health literacy [3]. The American Medical Association (AMA) and National Institutes of Health (NIH) recommend PEMs be written at or below a 6th-grade reading skill level [4, 5]. Previous research on the readability of online patient resources for various cardiovascular pathologies and procedures has demonstrated that most online PEMs fail to adhere to these recommendations [6,7,8]. In fact, most online medical literature available to the general public is written at a high school or college student level, raising serious concerns related to patient decision-making and education [9, 10].

Aortic stenosis (AS) is a prevalent valvular disease in the elderly population, with the yearly incidence of AS among patients in the USA and Europe estimated to be 4.4% [11]. Left untreated, AS can be fatal within a few years of symptom onset, which highlights the importance of improving inadequate health literacy among patients with cardiac health conditions [12]. As such, the aim of the present study was to ascertain whether two freely available, widely used AI dialogue platforms can rewrite existing PEMs on AS to adhere to reading skill level recommendations while retaining accuracy of medical content.

Methods

Adapting previous methods by Kirchner et al., online PEMs pertaining to AS were collected on September 10, 2023 on a private browsing window on Google Chrome (version 116.0.5845.179) via web searches of the top 20 leading academic cardiology, heart, and vascular surgery institutions in the USA as per the US News and World Report (USNWR) 2023 hospital rankings [13, 14]. Additionally, PEMs on AS were gathered from available online patient resources provided by a professional cardiothoracic surgical society. Browser history and cache were cleared prior to web searches. PEMs that were video-based or already written at or below the 6th-grade reading skill level according to at least two of four utilized readability measures were excluded from analysis. The present study was exempt from institutional review as it did not involve human subjects research. This article does not contain any new studies with human participants or animals performed by any of the authors.

Once collected, online PEMs were reviewed for patient education descriptions of AS through group discussion by multiple investigators. Once identified, these patient education descriptions were assessed for readability via an online application (https://readable.com, Added Bytes Ltd., Brighton, England) using four validated readability measures: Flesch Reading Ease (FRE) score, Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), and Simple Measure of Gobbledygook Index (SMOGI) [15,16,17,18]. These measures are the most commonly used readability tests in the health literacy literature, and are calculated using the following formulas [9]:

$${\text{FRE score}}= 206.835-1.015\left(\frac{{\text{total number of words}}}{{\text{total number of sentences}}}\right)-84.6\left(\frac{{\text{total number of syllables}}}{{\text{total number of words}}}\right)$$
$${\text{FKGL}}= 0.39\left(\frac{{\text{total number of words}}}{{\text{total number of sentences}}}\right)+11.8\left(\frac{{\text{total number of syllables}}}{{\text{total number of words}}}\right)-15.59$$
$$\mathrm{GFI}= 0.4 \left[\left(\frac{{\text{total number of words}}}{{\text{total number of sentences}}}\right)+100\left(\frac{{\text{total number of complex words}}}{{\text{total number of words}}}\right)\right]$$
$$\mathrm{SMOGI}= 3+ \sqrt{{\text{polysyllable word count}}}$$

The FRE score ranges from 0 to 100 and corresponds to an American educational level, with higher scores indicating easier reading material (100–90, very easy to read/5th-grade level; 90–80, easy to read or conversational English/6th-grade level; 80–70, fairly easy to read/7th-grade level; 70–60, plain English/8th- to 9th-grade level; 60–50, fairly difficult/10th- to 12th-grade level; 50–30, difficult/college level; 30–10, very difficult/college graduate level; 10–0, extremely difficult/professional level) [15]. FKGL, GFI, and SMOGI range from 0 to 18, 0 to 20, and 5 to 18, respectively, and each score indicates the number of years of education necessary to understand the assessed reading material [16,17,18]. Thus, PEMs with FKGL, GFI, and SMOGI scores of approximately 7 or lower, and FRE scores greater than 80, are generally considered in congruence with AMA and NIH readability recommendations [4, 5]. The FRE and SMOGI measures have previously been highlighted for their utility in patient education and health literacy research [9, 19].

Next, patient education descriptions of AS from each PEM were entered into the freely available LLMs ChatGPT-3.5 (https://chat.openai.com/chat; Version August 3, 2023, OpenAI, San Francisco, CA, USA) and Bard (https://bard.google.com; Version July 13, 2023, Google, Mountain View, CA, USA), preceded by the prompt “translate to 5th-grade reading level”. These AI dialogue platforms were prompted to translate text to a lower reading level than recommended by the AMA and NIH to account for potential variability in the interpretation of reading skill levels by the LLMs. The AI-generated text material from ChatGPT-3.5 and Bard was then reevaluated for readability and accuracy.

Our primary endpoint was the absolute difference in FRE, FKGL, GFI, and SMOGI scores of each PEM before and after conversion by both AI platforms. To evaluate conversion consistency, four independent conversions of PEMs by each AI dialogue platform were performed [13]. Mean readability scores were recorded and used for subsequent analysis. Percentage change in readability scores before and after AI conversion was calculated. The accuracy of medical content from converted PEMs was secondarily assessed through independent review by multiple investigators. Time (seconds) elapsed between original prompting of each AI platform and completion of text generation was also recorded for each PEM.

Continuous variables were presented as means ± standard deviation (SD) or medians with interquartile range if non-parametric. The normality of variables was assessed using quantile–quantile plots and Shapiro–Wilk testing. Student’s t tests or Wilcoxon rank-sum tests were subsequently used, as appropriate, to compare text characteristics, FRE, FKGL, GFI, and SMOGI scores before and after conversion, as well as percentage change in readability scores by each AI platform. A two-sided p value ≤ 0.05 was considered statistically significant. Statistical analyses were performed using Stata version 18.0 (StataCorp, College Station, TX, USA).

Results

Overall, 21 online PEMs on AS were gathered from USNWR-ranked institutions and a professional cardiothoracic surgical society. All 21 original PEMs were written above the 6th-grade reading level and were included for analysis. Mean (SD) FRE, FKGL, GFI, and SMOGI scores of original PEMs were 51.9 (11.6), 9.4 (2.0), 11.3 (2.1), 10.1 (1.2), respectively, indicating difficult readability at the 10th–12th grade reading level (Table 1). Each PEM was subsequently converted by two AI dialogue platforms without system error or the need for additional direction by the end user (Table 2).

Table 1 Readability scores of original patient education materials on aortic stenosis prior to conversion by artificial intelligence platforms
Table 2 Samples of artificial intelligence-generated patient education materials on aortic stenosis

Conversion of PEMs by ChatGPT-3.5 demonstrated significantly improved mean readability scores across all four validated measures: FRE (76.9 ± 1.2 vs 51.9 ± 11.6, p < 0.001), FKGL (5.9 ± 0.8 vs 9.4 ± 2.0, p < 0.001), GFI (7.7 ± 0.9 vs 11.3 ± 2.1, p < 0.001), and SMOGI (8.8 ± 0.6 vs 10.1 ± 1.2, p < 0.001) scores, indicating fairly easy readability at the approximately 6th–7th grade reading level.

Conversion of PEMs by Bard resulted in significantly improved mean readability scores among three of four validated measures: FRE (66.5 ± 5.2 vs 51.9 ± 11.6, p < 0.001), FKGL (6.9 ± 0.8 vs 9.4 ± 2.0, p < 0.001), and GFI (9.3 ± 1.0 vs 11.3 ± 2.1, p < 0.001) scores, indicating plain English readability at the approximately 8th–9th grade reading level. However, conversion by Bard did not show significant improvement in mean SMOGI (10.0 ± 0.8 vs 10.1 ± 1.2, p = 0.729) scores.

Compared to Bard, ChatGPT-3.5 demonstrated significantly easier readability after conversion as per FRE (76.9 ± 1.2 [ChatGPT-3.5] vs 66.5 ± 5.2 [Bard], p < 0.001), FKGL (5.9 ± 0.8 [ChatGPT-3.5] vs 6.9 ± 0.8 [Bard], p = 0.001), GFI (7.7 ± 0.9 [ChatGPT-3.5] vs 9.3 ± 1.0 [Bard], p < 0.001), and SMOGI (8.8 ± 0.6 [ChatGPT-3.5] vs 10.0 ± 0.8 [Bard], p < 0.001) scores (Fig. 1). Additionally, ChatGPT-3.5 showed significantly greater mean percentage change across all four readability measures compared to Bard (Table 3). ChatGPT-3.5 also had a significantly lower mean time (seconds) to generate converted PEMs than Bard (5.38 s ± 0.67 s vs 8.57 s ± 1.29 s, p < 0.001).

Fig. 1
figure 1

Violin plots of (i) Flesch Reading Ease scores, (ii) Flesch-Kincaid Grade Level scores, (iii) Gunning-Fog Index scores, and (iv) Simple Measure of Gobbledygook scores before and after conversion by each artificial intelligence platform. *Indicates significance (p < 0.05)

Table 3 Percentage change in readability scores by each artificial intelligence platform

After an independent review by multiple investigators of each original PEM and the AI-generated material, there were no factual errors or inaccuracies identified.

Discussion

Since their introduction in November 2022, LLMs such as ChatGPT-3.5 and Bard have been utilized throughout medical education, research, and practice, with over 50 recent studies describing the great potential of these emerging platforms in healthcare [20]. In cardiology and cardiothoracic surgery, others have described the potential benefits of AI as an adjunct to clinical decision-making and diagnosis [21,22,23]. No previous studies to our knowledge have evaluated the utility of AI dialogue platforms in improving the readability of existing PEMs on AS. In this pilot study, we assessed the ability of ChatGPT-3.5 and Bard to rewrite PEMs to meet recommended reading skill levels for patients with AS, finding varied performance between the two AI platforms. While ChatGPT-3.5 improved readability across all four validated scores, ChatGPT-3.5 was only able to reach recommended reading skill levels among FRE and FKGL measures. Moreover, Bard improved readability in all measures except for SMOGI but was only able to reach recommended reading skill levels in the FKGL measure. Across post-conversion readability scores, overall percentage change in readability scores, and conversion time, ChatGPT-3.5 consistently demonstrated greater utility in improving the readability of PEMs compared to Bard.

Previous studies have similarly evaluated the utility of AI chatbots in improving the readability of cardiac PEMs [24]. Moons et al. prompted ChatGPT-3.5 and Bard to reduce the reading levels of cardiac PEMs from the Journal of the American Medical Association, Cochrane Library, and the European Journal of Cardiovascular Nursing [24]. In congruence with our findings, researchers found that while both ChatGPT-3.5 and Bard significantly improved the readability of selected PEMs, neither platform was able to reach the recommended 6th-grade reading skill level [24]. It is not surprising that these chatbots encountered difficulty in reaching lower reading levels, as the platforms themselves are designed to generate outputs written at the reading level of an average high school student. Nevertheless, this may represent an inherent limitation of the present version of these freely available LLMs, although rapid updates and advancements in this AI technology may soon ameliorate the issue.

Despite the recommendations of the AMA and NIH, many previous studies in cardiology, cardiothoracic surgery, and vascular surgery have described most available PEMs to be widely incomprehensible to most of the public [4,5,6,7,8, 25, 26]. Similarly, the present study found all original PEMs on AS to be written above the recommended 6th-grade reading skill level, with most written for patients with at least a high school level of education. As low health literacy has been reported in up to 40% of patients with cardiac disease and is associated with lower rates of medication adherence and higher rates of rehospitalization, it is essential that healthcare providers prioritize the development and distribution of PEMs that are accessible and understandable to members of the public, including those with lower health literacy [12, 27, 28].

Complex medical and surgical management of AS renders the task of accurately and comprehensibly conveying online patient literature uniquely difficult. This presents challenges across many treatment domains in twenty-first century healthcare. In a paradigm that supports shared decision-making, appropriate efforts should be taken to ensure optimal participation within this patient population as the degree of understanding and education required continues to become more sophisticated [29]. Generative AI dialogue platforms may help facilitate this process through the systematic development of comprehensible PEMs on a variety of cardiac pathologies and treatments.

While our analysis did not identify any factual errors in the AI-generated medical content, the variability in quality and accuracy of medical information produced by ChatGPT-3.5 and other platforms has been well documented [30, 31]. A recent survey-based study presented 20 experts in congenital heart disease with AI-generated clinical vignettes and found the medical content produced by LLMs to often be incomplete and misleading [30]. Others have similarly questioned the trustworthiness and overall appropriateness of AI-generated responses to medical questions posed by patients [31, 32]. Ultimately, the potential use of AI dialogue platforms by patients to supplement medical decision-making underscores a pressing need for the involvement of regulators and healthcare professionals in establishing minimum quality standards and raising awareness among patients about the limitations of emerging AI chatbots.

The present study was not without limitations. First, the use of the USNWR to select the top 20 academic cardiology, heart, and vascular surgery institutions may limit the generalizability of our results. It is possible that PEMs from these institutions may vary in quality and readability from those at others throughout the USA. However, these major academic institutions were selected to capture a representative sample of PEMs on AS from leading centers. Second, the AI dialogue platforms only received one prompt in this study. Better phrases may exist to query LLMs regarding PEM readability improvement, although our approach was adapted from previously published literature [13]. Third, the present study utilized freely available versions of ChatGPT-3.5 and Bard, whereas paid versions of these same AI platforms may include higher-level capabilities and functioning. Although it is possible that subscription-based versions may generate different results than those in the present study, ChatGPT-3.5 and Bard were ultimately chosen to evaluate freely accessible AI platforms that are most likely to be utilized by various stakeholders in patient engagement and health literacy [33]. Additionally, PEMs from countries outside the USA were not analyzed, which may also limit the generalizability of our results. Despite these limitations, the use of two popular AI dialogue platforms and the application of four validated readability measures to 21 PEMs from diverse sources strengthened the rigor of the present study.

Conclusion

AI dialogue platforms such as ChatGPT-3.5 and Bard demonstrate strong performance in improving the readability of PEMs on AS, although generated texts fall slightly short of reading skill level recommendations. On the basis of these findings, ChatGPT-3.5 may serve as a more useful tool over Bard to improve the readability of PEMs. As AI dialogue platforms rapidly advance and receive growing interest, virtual assistants may become increasingly capable of helping facilitate improvements in cardiovascular health literacy among members of the public.